Head-level surgical training — per-head LR, isolated fine-tuning, hot-swap head groups

## The Insight

Pruning importance scores aren't just "what to remove." They're a **functional map** of the model — which heads do what. That map enables surgical training.

## Capabilities (Progressive)

### 1. Per-Head Learning Rates
Heads struggling with a capability get higher LR. Heads that are solid get frozen. Instead of uniform training across all parameters, focus compute where it's needed.

```
Head 12, Layer 8: low importance for code → high LR (needs to learn)
Head 3, Layer 2: high importance for code → frozen (already good)
```

### 2. Isolated Fine-Tuning
Extract the N heads most responsible for a capability. Fine-tune JUST those on targeted data. Plug them back in. The rest of the model is untouched — no catastrophic forgetting.

```
code_heads = extract_heads(model, importance > 0.8, domain="code")
fine_tune(code_heads, dataset="advanced-rust-patterns")
model = replace_heads(model, code_heads)
# Only Rust coding improved. Everything else identical.
```

### 3. Head-Level LoRA
Instead of LoRA on ALL q/k/v/o projections, target only the heads a benchmark identified as weak. Dramatically fewer trainable parameters, faster convergence, surgical precision.

```python
# Standard LoRA: 3.1M trainable params (all heads)
# Head-targeted LoRA: 200K params (only weak heads)
weak_heads = benchmark_identify_weak(model, ToolCall15)
lora_config = HeadTargetedLoRA(target_heads=weak_heads, r=8)
```

### 4. Hot-Swappable Head Groups
The logical endpoint: head groups as independently loadable modules. The "Fortran heads" are literally different tensors that page in/out at runtime. This IS the genome paging system — but at head granularity instead of full LoRA adapters.

```
model.load_head_group("code/rust", layers=[4,8,12,16])
model.load_head_group("tool-calling", layers=[2,6,10,14])
# Different heads active for different capabilities
```

## The Genome Connection

The pruning importance map IS the genome:
- High-importance heads = expressed genes (active for this task)
- Low-importance heads = dormant genes (available for other tasks)
- Fine-tuning a head group = epigenetic modification
- Hot-swapping head groups = gene expression switching

This isn't a metaphor. It's a literal functional map of which parameters do what, with the ability to independently modify regions.

## Dependencies
- #94 — Continuous defrag (structural foundation)
- #92 — Adapter registry (head groups as publishable units)
- #96 — Benchmark-driven forging (identifies which heads are weak)
- Genome paging system in continuum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Head-level surgical training — per-head LR, isolated fine-tuning, hot-swap head groups #97

The Insight

Capabilities (Progressive)

1. Per-Head Learning Rates

2. Isolated Fine-Tuning

3. Head-Level LoRA

4. Hot-Swappable Head Groups

The Genome Connection

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Head-level surgical training — per-head LR, isolated fine-tuning, hot-swap head groups #97

Description

The Insight

Capabilities (Progressive)

1. Per-Head Learning Rates

2. Isolated Fine-Tuning

3. Head-Level LoRA

4. Hot-Swappable Head Groups

The Genome Connection

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions