-
Notifications
You must be signed in to change notification settings - Fork 0
Head-level surgical training — per-head LR, isolated fine-tuning, hot-swap head groups #97
Description
The Insight
Pruning importance scores aren't just "what to remove." They're a functional map of the model — which heads do what. That map enables surgical training.
Capabilities (Progressive)
1. Per-Head Learning Rates
Heads struggling with a capability get higher LR. Heads that are solid get frozen. Instead of uniform training across all parameters, focus compute where it's needed.
Head 12, Layer 8: low importance for code → high LR (needs to learn)
Head 3, Layer 2: high importance for code → frozen (already good)
2. Isolated Fine-Tuning
Extract the N heads most responsible for a capability. Fine-tune JUST those on targeted data. Plug them back in. The rest of the model is untouched — no catastrophic forgetting.
code_heads = extract_heads(model, importance > 0.8, domain="code")
fine_tune(code_heads, dataset="advanced-rust-patterns")
model = replace_heads(model, code_heads)
# Only Rust coding improved. Everything else identical.
3. Head-Level LoRA
Instead of LoRA on ALL q/k/v/o projections, target only the heads a benchmark identified as weak. Dramatically fewer trainable parameters, faster convergence, surgical precision.
# Standard LoRA: 3.1M trainable params (all heads)
# Head-targeted LoRA: 200K params (only weak heads)
weak_heads = benchmark_identify_weak(model, ToolCall15)
lora_config = HeadTargetedLoRA(target_heads=weak_heads, r=8)4. Hot-Swappable Head Groups
The logical endpoint: head groups as independently loadable modules. The "Fortran heads" are literally different tensors that page in/out at runtime. This IS the genome paging system — but at head granularity instead of full LoRA adapters.
model.load_head_group("code/rust", layers=[4,8,12,16])
model.load_head_group("tool-calling", layers=[2,6,10,14])
# Different heads active for different capabilities
The Genome Connection
The pruning importance map IS the genome:
- High-importance heads = expressed genes (active for this task)
- Low-importance heads = dormant genes (available for other tasks)
- Fine-tuning a head group = epigenetic modification
- Hot-swapping head groups = gene expression switching
This isn't a metaphor. It's a literal functional map of which parameters do what, with the ability to independently modify regions.
Dependencies
- Structural pruning defrag — mask first, compact later (like disk defrag) #94 — Continuous defrag (structural foundation)
- Adapter registry — semantic search, auto-forge, publish (npm for intelligence) #92 — Adapter registry (head groups as publishable units)
- Benchmark-driven forging — use third-party benchmarks as both curriculum and proof #96 — Benchmark-driven forging (identifies which heads are weak)
- Genome paging system in continuum