The fine-tuning of a base model as described in the temporal and theme documentation did not work well — the result of the temporal fine-tuning was catastrophic forgetting. The LoRA/QLoRA approach (even with reduced learning rates, replay data, and careful hyperparameter tuning) failed to reliably suppress post-1969 knowledge without destroying the model's general capabilities.
This document explores training a new model from scratch using:
- Temporally filtered Wikipedia content (pre–July 1969)
- Select books and text material for thematic alignment (Project Gutenberg)
| Component | Specification |
|---|---|
| CPU/APU | AMD Ryzen AI MAX+ 395 "Strix Halo" |
| GPU | Integrated RDNA 3.5, 40 CUs (gfx1151) |
| Memory | 128 GB LPDDR5x-8000 (unified/shared CPU+GPU) |
| Memory Bandwidth | ~215 GB/s |
| Estimated FP16 Compute | ~25–30 TFLOPS peak |
| GPU Framework | ROCm 7.2 (via Strix Halo Toolboxes container) |
| OS | Fedora 43, kernel 6.18.4+ |
The key advantage of this system is the 128 GB of unified memory — far more than any consumer discrete GPU (RTX 4090: 24 GB). This makes it possible to train surprisingly large models from scratch, since training memory requirements greatly exceed inference requirements.
Note: ROCm 7.1.1 is incompatible with kernels ≥ 6.18.4 and has been deprecated. Always use ROCm 7.2+ with modern kernels. See StrixHalo-Fedora-Setup.md for full setup details.
| Component | Specification |
|---|---|
| GPU | NVIDIA A4000 (16 GB GDDR6, Ampere / SM 8.6) |
| Host | VM with PCI passthrough, 24 GB RAM, Fedora 43 |
| CUDA | Container-based (llama.cpp server-cuda images) |
| Services | LLM server (port 1234) + Embedding server (port 1235) |
| Default LLM | Qwen 2.5 14B Q4_K_M (~10–11 GB VRAM) |
| Default Embedding | nomic-embed-text-v1.5 F16 (~0.5 GB VRAM) |
The A4000 provides a secondary OpenAI-compatible inference server that offloads LLM and embedding workloads from the Strix Halo. This is particularly useful during training — the iGPU can focus on CPT while the A4000 handles dataset generation (SFT data via generate_theme_dataset.py), embedding computation (process_and_index.py), and interactive testing. Set REMOTE_HOST in the environment to route inference requests to the A4000 automatically. See A4000-Fedora-Setup.md for setup details.
Yes, within constraints. Training from scratch is feasible for models in the 100M–3B parameter range.
Training requires storing the model weights, optimizer states (Adam uses two additional copies), gradients, and activations in memory simultaneously:
| Model Size | FP32 Training | Mixed Precision (BF16 + FP32 optimizer) | Fits in 128 GB? |
|---|---|---|---|
| 125M | ~2 GB | ~1.5 GB | Yes |
| 350M | ~5 GB | ~4 GB | Yes |
| 1B | ~20 GB | ~15–20 GB | Yes |
| 3B | ~60 GB | ~50–65 GB | Yes (tight with large batches) |
| 7B | ~140 GB | ~110–130 GB | Marginal to No |
Estimates include model weights (×1 for BF16), optimizer states (×2 in FP32 for Adam), gradients, and a moderate activation memory budget. Actual usage depends on batch size, sequence length, and gradient checkpointing.
Based on ~25 TFLOPS effective FP16 and a realistic 25–35% Model FLOPS Utilization (MFU) for single-GPU training on ROCm:
| Model Size | Tokens/sec (estimated) | Time for 3B tokens | Time for 10B tokens |
|---|---|---|---|
| 125M | ~8,000–12,000 | ~3–4 days | ~10–14 days |
| 350M | ~3,000–5,000 | ~7–12 days | ~23–39 days |
| 1B | ~1,000–1,700 | ~20–35 days | ~68–116 days |
| 3B | ~350–600 | ~58–99 days | ~193–331 days |
Single GPU (iGPU), no data parallelism. MFU conservatively estimated at 25–35% due to RDNA 3.5 lacking dedicated matrix/tensor cores. Actual throughput should be benchmarked early.
| Model Size | Feasible? | Training Time | Recommendation |
|---|---|---|---|
| 125–350M | Very feasible | Days to 2 weeks | Best for rapid iteration and proof of concept |
| 1B | Feasible | 3–5 weeks | Best balance of capability and training time |
| 3B | Marginal | 2–3+ months | Only if 1B results are promising and patience allows |
| 7B+ | Not feasible | 6+ months, memory-constrained | Exceeds practical limits of this hardware |
Start from an existing base model (not instruct-tuned) and do full-weight continued pre-training on the curated pre-1969 corpus. This is fundamentally different from LoRA fine-tuning:
- Full weight updates (not low-rank adapters)
- Extended training on the new corpus (multiple epochs)
- Gradually overwrites post-1969 knowledge rather than trying to coexist with it
- Preserves language modeling capabilities (grammar, coherence, reasoning)
Why this avoids the catastrophic forgetting problem: With LoRA, the base model still contains post-1969 knowledge unchanged — you're only adding a low-rank correction. With CPT, you're actually modifying all the weights, so post-1969 knowledge gets diluted and overwritten by the dominant pre-1969 training signal.
Pipeline:
Base Model → CPT on Pre-1969 Corpus → SFT for Chat/Theme → GGUF Export
(full weight, many epochs) (LoRA is fine here) (llama.cpp)
Two base models are selected — a small one for verifying the approach quickly, and a production model for the final training run. Both use Llama-family architectures for native GGUF/llama.cpp compatibility.
| Role | Model | Params | Architecture | Pre-Training Data | Rationale |
|---|---|---|---|---|---|
| Dev | SmolLM2-360M | 360M | Llama | 4T tokens (FineWeb-Edu, DCLM, The Stack) | Modern (2025), Llama-native GGUF, strong benchmarks for size, CPT in ~1–2 weeks |
| Prod | TinyLlama-1.1B | 1.1B | Llama 2 | 3T tokens (SlimPajama + StarCoder) | Best-in-class ~1B model; Llama 2 arch → drop-in GGUF/LM Studio support; extensive 3T pre-training gives a strong baseline to build on |
Why TinyLlama-1.1B over alternatives:
| Alternative | Why Not |
|---|---|
| Pythia-1B (EleutherAI) | GPT-NeoX architecture requires extra conversion work for GGUF; trained on only 300B tokens (10× less than TinyLlama), resulting in a significantly weaker baseline |
| OLMo-1B (AI2) | Custom architecture with non-parametric LayerNorm and unique design choices; less ecosystem support for GGUF export and llama.cpp inference |
| SmolLM2-1.7B | At 1.7B parameters, exceeds the ~1B budget and would roughly double training time (~6–10 weeks for CPT) |
Why SmolLM2-360M over alternatives:
| Alternative | Why Not |
|---|---|
| Pythia-410M (EleutherAI) | GPT-NeoX architecture; trained on only 300B tokens (13× less than SmolLM2); lower benchmarks across the board |
| Pythia-160M | Too small to meaningfully validate CPT quality — results wouldn't transfer to the 1.1B production run |
| SmolLM2-135M | Same concern — useful for pipeline smoke testing but too small to validate the CPT approach itself |
Training from scratch (random initialization) would give absolute zero post-1969 leakage by construction, but requires the model to learn language structure, grammar, and all world knowledge from the ~3B token corpus alone. This is data-starved for models above ~350M and would produce significantly weaker results than CPT from a well-trained base.
Distillation from a filtered teacher (using a 70B model to generate synthetic training data) could produce high-quality temporally-filtered training examples, but adds complexity, depends on teacher quality, and introduces a risk of subtle temporal leakage from the teacher's parametric knowledge. This remains viable as a data augmentation technique on top of CPT if results need improvement.
| Source | Articles/Works | Raw Text | Estimated Tokens | Temporal Compliance |
|---|---|---|---|---|
| Pre-1969 Wikipedia articles | ≥1.55M articles | 5.7 GB | ≥1.4B tokens | Filtered via YAGO/Wikidata + LLM temporal classification |
| Project Gutenberg books | 766 books (pre-1969, thematically aligned) | 498 MB | ~125M tokens | Pre-1969 by definition (public domain) |
| Chess content (games + books) | ~356K games + 10 books | 213 MB | ~53M tokens | Filtered to pre-July 1969 |
| Total | — | ~6.4 GB | ≥1.6B tokens | Fully pre-1969 |
Token estimates use ~4 characters per token (typical for Llama-family BPE tokenizers on English text).
Wikipedia classification: Of 7.0M total articles in the database, temporal classification is available for about ~40% of the content — 1.55M classified as pre-1969 ("old"), 785K as post-1969 ("new"), 466K as uncertain, and 4.2M still unclassified. The article count and token figures above are lower bounds that will increase if LLM-based temporal augmentation (
scripts/llm_temporal_analysis_augmentation.py) is used against the remaining articles. Projected final corpus: ~2–3M qualifying articles, ~2.5–4B tokens.
Chess content: Deep Red's origin as a chess-playing AI requires genuine chess knowledge — rules, notation, strategy, opening theory, famous games, and player history through 1969. Sources include Internet Archive chess books (10 texts, 3.7M chars) and PGN historical game databases converted to natural language prose (~356K pre-1969 games from 683 PGN files, 209M chars). See Chess-Setup.md for full source list, conversion approaches, and retrieval details.
After CPT, the model needs supervised fine-tuning data to become a conversational agent with the Deep Red persona. This is a separate, much smaller dataset in ChatML format — structured as multi-turn conversations with system prompts, user questions, and persona-aligned responses.
What the SFT data contains:
| Field | Content |
|---|---|
| System prompt | One of ~10+ Deep Red persona variants (primary, mission control, philosophical, grandmaster, tactician, commissar, scientist, engineer, etc.) — each emphasizing different facets of the AI character |
| User message | A natural question or topic drawn from a source passage (e.g., "What is the purpose of the Mars mission?", "How does the colony sustain itself?") |
| Assistant response | A Deep Red–style answer: collectivist language, chess/strategy metaphors, scientific optimism, era-appropriate vocabulary |
Example ChatML training entry:
{
"messages": [
{"role": "system", "content": "You are Deep Red, a chess-playing artificial intelligence that guides humanity's Mars city utopia..."},
{"role": "user", "content": "What drives humanity's push toward space exploration?"},
{"role": "assistant", "content": "The cosmos presents the ultimate positional challenge, comrade. Like a grandmaster who sees..."}
]
}How the SFT data is generated:
The existing scripts/generate_theme_dataset.py generates these examples semi-automatically:
- Takes keyword-filtered Gutenberg text chunks as input (from Phase 2:
scripts/chunk_gutenberg.py→scripts/keyword_filter.py) - Sends each chunk to a local LLM via LM Studio (e.g., Qwen 2.5 7B Instruct)
- The LLM generates a natural user question based on the chunk's content
- The LLM then generates a Deep Red–style response, guided by a randomly selected persona system prompt
- Output is written as ChatML JSONL
Source material and pipeline:
Gutenberg filtered chunks (from scripts/keyword_filter.py)
│
├── ~20K–100K thematically relevant passages
│
└── scripts/generate_theme_dataset.py
│
├── LM Studio (teacher LLM) generates Q&A pairs
├── Randomly cycles through Deep Red persona variants
├── 1–2 examples per chunk → ~10K–50K ChatML examples
│
└── theme_dataset.jsonl (~5–25M tokens)
Estimated SFT dataset size: ~10,000–50,000 examples (~5–25M tokens). This is tiny compared to the CPT corpus, but that's expected — SFT only needs to teach conversational patterns and persona style, not world knowledge.
Where to source additional SFT data if needed:
| Source | Type | Notes |
|---|---|---|
| Gutenberg filtered chunks (existing) | Primary source | Already chunked + keyword-filtered; the main pipeline |
| Wikipedia pre-1969 articles | Supplement | Run the same generation script on Wikipedia excerpts for broader topic coverage |
Year topics from scripts/extract_year_topics.py |
Supplement | Historical events as prompts — good for temporal Q&A examples |
| Manually written examples | Quality anchor | Hand-craft 50–100 "gold standard" conversations to steer tone and style |
See ThemeFinetuning-DataPreparation-Phase3.md for full setup and execution details.
The Chinchilla scaling laws suggest an optimal ratio of ~20 tokens per parameter:
| Model Size | Optimal Tokens | Available Tokens | Strategy |
|---|---|---|---|
| 125M | ~2.5B | 2.5–4.5B | Sufficient — 1–2 epochs |
| 350M | ~7B | 2.5–4.5B | Slightly data-starved — train 2–3 epochs |
| 1B | ~20B | 2.5–4.5B | Data-starved — train 5–10 epochs, augment data |
| 3B | ~60B | 2.5–4.5B | Heavily data-starved — train 15+ epochs |
Training for more epochs than Chinchilla-optimal increases the risk of memorization/overfitting, but for a domain-specific model this is acceptable — the model should memorize pre-1969 knowledge. Use a held-out validation set to monitor loss.
Wikipedia DB (7M articles)
│
├── Filter: latest_date <= 1969-07-20 OR (earliest_date <= 1969 AND latest_date IS NULL)
│ → ~1.2M pre-1969 articles
│
├── Extract clean text (strip wikitext markup)
│ → scripts/extract_wikipedia.py
│
├── Chunk into training sequences (2048 or 4096 tokens)
│
└── Shuffle and write to tokenized binary format
Project Gutenberg (~500 books)
│
├── Already retrieved via scripts/retrieve_gutenberg.py
│
├── Chunk via scripts/chunk_gutenberg.py
│
└── Merge with Wikipedia data
| Framework | Best For | ROCm Support | Notes |
|---|---|---|---|
| nanoGPT | ≤350M models, simplicity | Yes (PyTorch) | Minimal code, easy to understand and modify. Good for learning and small experiments. |
| LitGPT | 1B–3B, modern architectures | Yes (PyTorch) | Supports Llama/Mistral/Pythia architectures. Built on Lightning, good for both pre-training and fine-tuning. |
| torchtune | 1B–7B, Meta's official tool | Yes (PyTorch) | Native PyTorch library for training/fine-tuning LLMs. Well-maintained, good Llama support. |
| Hugging Face Transformers + Accelerate | Any size, most flexible | Yes (PyTorch) | Full-featured, vast model zoo, well-documented. Slightly more overhead than minimal frameworks. |
| Megatron-LM | Multi-GPU only | Limited | Overkill for single-GPU — skip this. |
Recommendation: Use nanoGPT for pipeline smoke testing with a tiny model (Dev Phase 3), then use LitGPT or torchtune for both the dev CPT (SmolLM2-360M) and production CPT (TinyLlama-1.1B) runs.
| Tool | Purpose |
|---|---|
| tiktoken / sentencepiece | Tokenizer training (if training from scratch) |
| HuggingFace tokenizers | Fast BPE tokenizer training and application |
| datasets (HuggingFace) | Efficient data loading and streaming for large corpora |
Existing scripts in scripts/ |
extract_wikipedia.py, chunk_gutenberg.py, keyword_filter.py |
| Tool | Purpose |
|---|---|
| llama.cpp | GGUF conversion and quantization (already built with HIP in the setup) |
scripts/convert_to_gguf.py |
Existing conversion script |
| LM Studio | Local inference and testing |
TinyLlama-1.1B uses the Llama 2 architecture — RoPE, SwiGLU, RMSNorm, GQA — providing full ecosystem compatibility with llama.cpp, LM Studio, and GGUF tooling:
| Parameter | Value | Notes |
|---|---|---|
| Parameters | 1.1B | Inherited from base model |
| Hidden dimension | 2048 | |
| Layers | 22 | |
| Attention heads | 32 | |
| KV heads (GQA) | 4 | Reduces memory and inference cost |
| Intermediate size | 5632 | SwiGLU MLP |
| Context length | 2048 | Sufficient for training; extend later if needed |
| Vocabulary size | 32,000 | Llama 2 BPE tokenizer |
| Positional encoding | RoPE | Required for Llama compatibility and GGUF export |
The dev model (SmolLM2-360M) uses a similar Llama architecture but with 960 hidden dim, 32 layers, 15 heads, 5 KV heads, and a 49,152-token vocabulary.
| Parameter | Value | Notes |
|---|---|---|
| Precision | BF16 mixed precision | Saves memory; ROCm supports BF16 on RDNA 3.5 |
| Optimizer | AdamW | Standard; β₁=0.9, β₂=0.95, ε=1e-8 |
| Learning rate | 3e-4 (peak) | With cosine decay to 3e-5 |
| Warmup | 2,000 steps | ~1% of total training |
| Weight decay | 0.1 | Standard regularization |
| Batch size | Micro-batch 4–8, gradient accumulation to effective batch ~128–256 | Tune to fit memory |
| Sequence length | 2048 tokens | Reduce to 1024 if memory-constrained |
| Gradient checkpointing | Enabled | Trades compute for memory — essential |
| Epochs | 5–10 over the full corpus | Data-starved regime; monitor val loss for overfitting |
| Total tokens | ~15–30B (5–10 epochs × 3B tokens) |
These are automatically set by the venv activate script (added by setup_strixhalo.py):
# Required for Strix Halo gfx1151 (ROCm 7.2)
export HSA_OVERRIDE_GFX_VERSION=11.0.0
# Enable optimized matrix math
export ROCBLAS_USE_HIPBLASLT=1
# PyTorch ROCm memory management
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:TrueAfter pre-training (or continued pre-training), the model has raw language modeling capability but no chat or instruction-following behavior. Apply these steps in order:
Use LoRA fine-tuning (this is where LoRA works well — the model already has the right knowledge):
- Use the existing thematic ChatML dataset from
scripts/generate_theme_dataset.py - Apply the Deep Red system prompt and persona
- LoRA rank 16–32, learning rate 2e-5, 2–3 epochs
- This should NOT cause catastrophic forgetting because the base knowledge is already temporally correct
# Merge LoRA adapter with base model
python scripts/merge_lora.py --base output/pretrained-1b/ --adapter output/sft-lora/ --output output/merged/
# Convert to GGUF for LM Studio
python scripts/convert_to_gguf.py --model output/merged/ --output output/deepred-1b.gguf --quantize Q4_K_MUse the existing scripts/evaluate_temporal.py to check temporal separation, then manually test in LM Studio.
| Capability | 125–350M | 1B | 3B |
|---|---|---|---|
| Grammatical coherence | Good for short passages | Good | Very good |
| Paragraph-length generation | Adequate, may drift | Good | Very good |
| Pre-1969 factual Q&A | Basic recall | Solid recall | Strong recall |
| Post-1969 knowledge leakage | None (by construction) | None (by construction) | None (by construction) |
| Thematic/persona consistency | Moderate (with SFT) | Good (with SFT) | Very good (with SFT) |
| Multi-turn conversation | Limited | Moderate | Good |
| Reasoning / logic | Minimal | Basic | Moderate |
| Creative writing (era-appropriate) | Short passages | Paragraphs | Extended passages |
| Instruction following | Basic | Good | Good |
| Math / arithmetic | Unreliable | Basic | Basic |
| Overall quality comparison | GPT-2 Small/Medium | GPT-2 XL / early Pythia | Phi-2 / TinyLlama level |
- Era-appropriate conversation: Discussing pre-1969 history, science, culture, and politics with period-appropriate framing
- Deep Red persona: Collectivist language, chess metaphors, scientific rationalism, optimistic futurism
- Temporal consistency: Describing the world as it was before July 1969 — no anachronisms
- Thematic text generation: Writing passages in the style of mid-century literature, Soviet futurism, and early space age optimism
- Complex reasoning: Multi-step logical problems, mathematical proofs
- Code generation: Not in the training data
- Post-1969 anything: By design — but this means no knowledge of modern medicine, technology, or events
- Long document understanding: Limited context window (2048–4096 tokens)
- Multilingual: English-only unless multilingual data is included
- Competing with modern LLMs: A 1B model trained on 3B tokens cannot match a 70B model trained on 15T tokens on general benchmarks
The fundamental benefit of this approach is no knowledge leakage by construction. A fine-tuned model still contains post-1969 knowledge in its weights and may leak it through indirect questioning, adversarial prompting, or chain-of-thought manipulation. A model trained only on pre-1969 content literally does not have that knowledge to leak.
The roadmap is structured in three layers:
- System Setup — One-time infrastructure provisioning (Fedora, toolboxes, services)
- Initial Development Run — Fast end-to-end pass through every script and step using minimal data and a tiny model. The goal is to exercise and validate the entire pipeline, not to produce a useful model. Expect garbage output — that's fine.
- Production Run — Full-scale data processing and multi-week training to produce the actual model.
Status (as of March 2026): Phase 0 (system setup), Prod Phase 1 (Wikipedia pipeline), and Prod Phase 2 (corpus preparation) are complete. The full corpus has been tokenized with the TinyLlama-1.1B tokenizer into 1.94B tokens, shuffled into 2048-token sequences, and split into train/val sets. The next step is Prod Phase 3 (dev CPT on SmolLM2-360M).
This phase sets up the Strix Halo machine from scratch using Fedora instead of the previous Ubuntu-based setup. The Strix Halo Toolboxes project provides containerized, pre-configured environments for AI workloads on Strix Halo.
| Step | Task | Status | Notes |
|---|---|---|---|
| 0.1 | Install Fedora (latest, kernel 6.18.4+) on Strix Halo | ✅ Done | Fedora 43, kernel 6.18.4+ |
| 0.2 | Configure GTT memory allocation (maximize GPU-accessible RAM) | ✅ Done | BIOS UMA minimized; kernel params allocate up to 124 GB dynamically |
| 0.3 | Install Strix Halo Toolboxes | ✅ Done | ROCm 7.2, PyTorch, llama.cpp available in containers |
| 0.4 | Install and configure PostgreSQL (for Wikipedia DB) | ✅ Done | 50 GB database operational; see WikipediaMCP-Setup.md |
| 0.5 | Install and configure OpenSearch (for semantic search) | ✅ Done | 41 GB index operational; systemd service configured |
| 0.6 | Install and configure LM Studio (headless server) | ✅ Done | Headless server via A4000 (REMOTE_HOST=192.168.42.15); see A4000-Fedora-Setup.md |
| 0.7 | Verify ROCm: rocminfo, run llama.cpp test inference |
✅ Done | gfx1151 detected, GPU offloading confirmed |
| 0.8 | Set up Python venv with training dependencies (PyTorch ROCm, transformers, PEFT, datasets) | ✅ Done | Venv at /mnt/data/venv; torch.cuda.is_available() returns True via HIP |
Note: This phase was skipped. The production data pipeline (Prod Phases 1–2) was executed directly, which validated all scripts and services at full scale. The dev run's goal — exercising every script and verifying every format — was achieved as part of the production pipeline work instead.
Original dev run plan (preserved for reference)
Goal: Exercise every script and step end-to-end in a single day. Use tiny data subsets and a minimal model. The output will be a non-functional toy model, but the pipeline will be fully validated — every script invoked, every format verified, every service confirmed working.
Guiding principle: If a step takes more than 30 minutes, you're using too much data. Cut the input size until it's fast.
| Step | Task | Duration (est.) | Notes |
|---|---|---|---|
| D1.1 | Download Wikipedia dump (full dump required, but can abort early or use a sample) | 2–4 hours | Alternatively, use an existing partial dump or the enwiki-latest-abstract.xml.gz (~1 GB) for smoke testing |
| D1.2 | Import a small subset into PostgreSQL (~1,000 articles) | 10–15 min | Limit extract_wikipedia.py input or truncate the dump; verify DB schema is correct |
| D1.3 | Run yago_parser.py on a truncated YAGO file (~10K lines) |
5 min | head -10000 yago-facts.ttl > yago-sample.ttl; verify CSV output format |
| D1.4 | Run normalize_temporal_output.py on the small YAGO CSV |
5 min | Verify English URL mapping and page ID lookup work against the small DB |
| D1.5 | Run augment_wikipedia_temporal.py on the small DB |
2 min | Verify earliest_date/latest_date columns are populated |
| D1.6 | Run process_and_index.py on the small DB subset |
10–15 min | Verify OpenSearch index is created and embeddings are stored |
| D1.7 | Start mcp_server.py and run a test search query |
5 min | Verify keyword + semantic search return results |
| Dev Phase 1 total | ~3–4 hours | Download time dominates; processing is minutes |
| Step | Task | Duration (est.) | Notes |
|---|---|---|---|
| D2.1 | Extract pre-1969 articles from the small DB | 2 min | May only yield a few hundred articles — that's fine |
| D2.2 | Clean extracted text | 2 min | Verify output format is clean plaintext |
| D2.3 | Run extract_year_topics.py for a single year (e.g., 1960) |
5 min | Verify JSON output structure |
| D2.4 | Run retrieve_gutenberg.py --priority-only with a limit of ~5 books |
10 min | Verify JSONL output format |
| D2.5 | Run chunk_gutenberg.py on the 5 books |
2 min | Verify chunk sizes and JSONL output |
| D2.6 | Run keyword_filter.py on the small chunk set |
2 min | Verify filtering and stats output |
| D2.7 | Select tokenizer from dev base model (SmolLM2-360M tokenizer) | 5 min | Download and verify tokenization round-trips |
| D2.8 | Tokenize the small corpus into binary training format | 5 min | Verify file format, sequence lengths, shuffling |
| D2.9 | Create train/validation split | 1 min | Even a 90/10 split on tiny data is fine |
| Dev Phase 2 total | ~1 hour |
| Step | Task | Duration (est.) | Notes |
|---|---|---|---|
| D3.1 | Configure nanoGPT for a ~10M parameter model (random init) | 15 min | 4 layers, 256 hidden dim, 4 heads; just enough to verify the training loop runs |
| D3.2 | Train for ~100 steps on the tiny corpus | 10–30 min | Verify loss decreases, checkpointing works, GPU is used |
| D3.3 | Verify checkpoint loading and sample generation | 5 min | Generate a few tokens — expect gibberish, just confirm it runs |
| Dev Phase 3 total | ~1 hour |
| Step | Task | Duration (est.) | Notes |
|---|---|---|---|
| D4.1 | Run generate_theme_dataset.py on ~10 filtered chunks |
10–15 min | Verify ChatML JSONL output with Deep Red persona |
| D4.2 | Run LoRA SFT on the 10M model with ~50 ChatML examples, 1 epoch | 10 min | Verify finetune_theme.py/finetune_temporal.py runs without errors |
| D4.3 | Merge LoRA adapter using merge_lora.py |
2 min | Verify merged model loads correctly |
| D4.4 | Convert to GGUF using convert_to_gguf.py |
2 min | Verify GGUF file is produced |
| D4.5 | Load GGUF in llama.cpp CLI and generate a few tokens | 5 min | Verify the file loads; output will be nonsense |
| D4.6 | Run evaluate_temporal.py on a few test questions |
5 min | Verify evaluation script runs; scores will be meaningless |
| Dev Phase 4 total | ~1 hour |
| Phase | Description | Duration |
|---|---|---|
| D1 | Wikipedia pipeline (small subset) | ~3–4 hours |
| D2 | Training corpus preparation (minimal) | ~1 hour |
| D3 | Training loop validation (10M model, 100 steps) | ~1 hour |
| D4 | SFT + merge + GGUF + eval (smoke test) | ~1 hour |
Total: ~1 day (excluding Wikipedia dump download time, which can run overnight)
Success criteria: Every script in the pipeline has been invoked and completed without errors. All intermediate file formats have been verified. All services (PostgreSQL, OpenSearch, LM Studio, MCP server) are operational. The pipeline produces a GGUF file that loads in llama.cpp. No step requires debugging during the production run.
Goal: Produce the actual Deep Red model. Full data, full training, real evaluation. Expect this to take 6–8 weeks from start of data processing to a deployed model.
| Step | Task | Status | Notes |
|---|---|---|---|
| P1.1 | Download full English Wikipedia dump (enwiki-*-pages-articles.xml.bz2, ~25 GB) |
✅ Done | 24 GB dump in /mnt/data/wikipedia/dumps/ |
| P1.2 | Extract and import all articles into PostgreSQL using scripts/extract_wikipedia.py |
✅ Done | 7,041,771 articles; 50 GB database; see WikipediaMCP-Setup.md |
| P1.3 | Download and parse full YAGO temporal data using scripts/yago_parser.py |
✅ Done | Normalized output: yago-facts-normalized.csv.zst (107 MB); see TemporalAugmentation-Setup.md |
| P1.4 | Normalize full YAGO output using scripts/normalize_temporal_output.py |
✅ Done | English URL mapping + page ID lookup complete |
| P1.5 | (Optional) Download and parse Wikidata temporal data using scripts/wikidata_parser.py |
✅ Done | wikidata-temporal-normalized.csv.zst (150 MB); broader coverage achieved |
| P1.6 | Augment Wikipedia DB with temporal metadata using scripts/augment_wikipedia_temporal.py |
✅ Done | 2,804,937 articles classified: 1,553,510 pre-1969 (O), 785,380 post-1969 (N), 466,047 uncertain (S), 4,236,834 unclassified (U) |
| P1.7 | Generate text embeddings and index in OpenSearch using scripts/process_and_index.py |
✅ Done | 41 GB OpenSearch index; BM25 + k-NN vector search operational |
| P1.8 | Verify MCP server search works with full index | ✅ Done | MCP server (port 7000) + React web GUI (port 8080) operational; systemd services configured |
| Step | Task | Status | Notes |
|---|---|---|---|
| P2.1 | Extract all pre-1969 Wikipedia articles from DB (SQL filter on temporal columns) | ✅ Done | 1,553,510 pre-1969 articles (5.74 GB raw text, ≥1.4B tokens); 1,683,075 with earliest_date < 1970 |
| P2.2 | Clean extracted text (strip wikitext markup, normalize formatting) | ✅ Done | Extraction pipeline handles markup stripping |
| P2.3 | Extract year topics using scripts/extract_year_topics.py (full year range) |
✅ Done | 1,844 year-topic files (years 151–2025) in /mnt/data/wikipedia/topics/; see Wikipedia-YearTopics-Setup.md |
| P2.4 | Retrieve Gutenberg full corpus using scripts/retrieve_gutenberg.py |
✅ Done | 766 books (497 MB JSONL) in /mnt/data/gutenberg/corpus/; see Gutenberg-Setup.md |
| P2.5 | Chunk Gutenberg texts using scripts/chunk_gutenberg.py |
✅ Done | Chunked, scored, and verified output in /mnt/data/gutenberg/theme_output/ |
| P2.6 | Filter Gutenberg chunks for thematic alignment using scripts/keyword_filter.py |
✅ Done | Filtered output in /mnt/data/gutenberg/theme_output/filtered/ |
| P2.6a | Retrieve chess corpus using scripts/retrieve_chess_content.py |
✅ Done | Phase 1: 683 PGN files (717 MB); Phase 2: 355,980 games → narrative JSONL (315 MB); Phase 3: 10 Internet Archive books (3.7 MB); see Chess-Setup.md |
| P2.7 | Select tokenizer from prod base model (TinyLlama-1.1B tokenizer) | ✅ Done | Llama 2 BPE tokenizer (vocab 32,000, EOS=2); downloaded to /mnt/data/training_corpus/tokenizers/TinyLlama-1.1B/; see TrainingCorpus-Setup.md |
| P2.8 | Tokenize full corpus into binary training format (shuffled, 2048-token sequences) | ✅ Done | 1.94B tokens across 5 sources (Wikipedia 1.64B + Gutenberg 147M + Chess games 153M + Year topics 2M + Chess books 1.8M); 49 shard files in /mnt/data/training_corpus/TinyLlama-1.1B/shards/ |
| P2.9 | Create train/validation split (99%/1%) | ✅ Done | 940,207 train seqs (1.93B tokens, 3.6 GB) + 9,497 val seqs (19.4M tokens, 37 MB); output: train.bin / val.bin in /mnt/data/training_corpus/TinyLlama-1.1B/ |
| Step | Task | Duration (est.) | Notes |
|---|---|---|---|
| P3.1 | Run CPT on SmolLM2-360M with pre-1969 corpus using scripts/train_deepred_model.py |
7–12 days | Full-weight BF16, gradient checkpointing; dev mode defaults to 5% data subset for fast validation; see DeepRedModel-Setup.md |
| P3.2 | Evaluate: loss curves, perplexity, sample generations, temporal compliance | 1 day | Key checkpoint — verify CPT actually suppresses post-1969 knowledge |
| P3.3 | Quick SFT test + GGUF export of the 360M model | 1 day | Verify end-to-end: does a CPT'd + SFT'd small model behave as expected? |
| P3.4 | Decide: proceed to 1.1B CPT, or adjust data/hyperparameters | — | If 360M shows good temporal separation, commit to the full TinyLlama run |
| Prod Phase 3 total | ~2 weeks |
| Step | Task | Duration (est.) | Notes |
|---|---|---|---|
| P4.1 | Download TinyLlama-1.1B base model from HuggingFace | 30 min | TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T; ~2.2 GB in FP16 |
| P4.2 | Run continued pre-training on full pre-1969 corpus (15–30B tokens, 5–10 epochs) | 3–5 weeks | Full-weight updates, BF16 mixed precision, gradient checkpointing; use hyperparameters validated in P3 |
| P4.3 | Monitor: validation loss, periodic sample generation, checkpoint every ~1B tokens | Ongoing | Watch for overfitting (val loss rising while train loss falls) |
| Prod Phase 4 total | 3–5 weeks |
| Step | Task | Duration (est.) | Notes |
|---|---|---|---|
| P5.1 | Generate full thematic ChatML dataset using scripts/generate_theme_dataset.py |
4–8 hours | Requires LM Studio running; uses all filtered Gutenberg chunks |
| P5.2 | SFT for Deep Red persona using LoRA (scripts/finetune_theme.py) |
1–2 days | LoRA rank 16–32, lr 2e-5, 2–3 epochs on ChatML data |
| P5.3 | Merge LoRA adapter using scripts/merge_lora.py |
30 min | |
| P5.4 | Convert to GGUF using scripts/convert_to_gguf.py |
30 min | Quantize to Q4_K_M or Q5_K_M |
| P5.5 | Deploy to LM Studio and evaluate | 1 day | Run scripts/evaluate_temporal.py + manual testing |
| Prod Phase 5 total | 2–3 days |
| Step | Task | Duration (est.) | Notes |
|---|---|---|---|
| P6.1 | Analyze failure modes (temporal leakage, persona drift, factual errors) | 1–2 days | Red-team testing, adversarial probing |
| P6.2 | Adjust data mix, hyperparameters, and retrain as needed | Ongoing | May loop back to Prod Phase 2 (data) or Phase 4 (training) |
| P6.3 | Consider distillation augmentation or extended CPT epochs if results need improvement | Ongoing | Teacher-filtered synthetic data (Approach alternative) or additional training passes |
| Prod Phase 6 total | Ongoing |