Skip to content

Head-level surgical training — per-head LR, isolated fine-tuning, hot-swap head groups #97

@joelteply

Description

@joelteply

The Insight

Pruning importance scores aren't just "what to remove." They're a functional map of the model — which heads do what. That map enables surgical training.

Capabilities (Progressive)

1. Per-Head Learning Rates

Heads struggling with a capability get higher LR. Heads that are solid get frozen. Instead of uniform training across all parameters, focus compute where it's needed.

Head 12, Layer 8: low importance for code → high LR (needs to learn)
Head 3, Layer 2: high importance for code → frozen (already good)

2. Isolated Fine-Tuning

Extract the N heads most responsible for a capability. Fine-tune JUST those on targeted data. Plug them back in. The rest of the model is untouched — no catastrophic forgetting.

code_heads = extract_heads(model, importance > 0.8, domain="code")
fine_tune(code_heads, dataset="advanced-rust-patterns")
model = replace_heads(model, code_heads)
# Only Rust coding improved. Everything else identical.

3. Head-Level LoRA

Instead of LoRA on ALL q/k/v/o projections, target only the heads a benchmark identified as weak. Dramatically fewer trainable parameters, faster convergence, surgical precision.

# Standard LoRA: 3.1M trainable params (all heads)
# Head-targeted LoRA: 200K params (only weak heads)
weak_heads = benchmark_identify_weak(model, ToolCall15)
lora_config = HeadTargetedLoRA(target_heads=weak_heads, r=8)

4. Hot-Swappable Head Groups

The logical endpoint: head groups as independently loadable modules. The "Fortran heads" are literally different tensors that page in/out at runtime. This IS the genome paging system — but at head granularity instead of full LoRA adapters.

model.load_head_group("code/rust", layers=[4,8,12,16])
model.load_head_group("tool-calling", layers=[2,6,10,14])
# Different heads active for different capabilities

The Genome Connection

The pruning importance map IS the genome:

  • High-importance heads = expressed genes (active for this task)
  • Low-importance heads = dormant genes (available for other tasks)
  • Fine-tuning a head group = epigenetic modification
  • Hot-swapping head groups = gene expression switching

This isn't a metaphor. It's a literal functional map of which parameters do what, with the ability to independently modify regions.

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions