Skip to content

Non-Record: Learning Adapters on Random Linear Maps — Selective Freeze, Progressive Freeze, and 7 Architecture Variants (H100 + A40 Validated)#1301

Open
himanshudongre wants to merge 2 commits intoopenai:mainfrom
himanshudongre:nonrecord/selective-freeze-random-linear-maps
Open

Non-Record: Learning Adapters on Random Linear Maps — Selective Freeze, Progressive Freeze, and 7 Architecture Variants (H100 + A40 Validated)#1301
himanshudongre wants to merge 2 commits intoopenai:mainfrom
himanshudongre:nonrecord/selective-freeze-random-linear-maps

Conversation

@himanshudongre
Copy link
Copy Markdown

@himanshudongre himanshudongre commented Apr 3, 2026

Summary

First systematic investigation of random linear maps for Parameter Golf, directly addressing the Requests for PRs item "Learning adapters on random linear maps." This work evaluates 7 architecture variants across 3 hardware configurations (H100, A40, M4) with FineWeb sp1024/sp4096 validation, totaling ~25 experiments at ~$45 self-funded compute.

Core finding: Selectively freezing MLP gate+up projections as deterministic random (from seeds, 0 bytes in artifact) enables fitting larger models in 16MB. A 12L frozen model beats a 6L fully-trained model by 11.5% on FineWeb. Progressive freeze (train fully, then freeze mid-training) outperforms random-init freeze by 1.3 percentage points on FineWeb sp4096.

Checks off "Learning adapters on random linear maps" from Requests for PRs.


1. The Idea

In Parameter Golf, the artifact budget (16MB) limits model size. But what if some weights cost 0 bytes?

Selective Freeze: Replace MLP gate+up projections with deterministic random matrices generated from per-layer seeds. At eval time, regenerate from seeds — zero artifact cost. Only attention weights + MLP down projections are learned and stored.

class FrozenFC(CastedLinear):
    def __init__(self, in_features, out_features, seed):
        super().__init__(in_features, out_features, bias=False)
        rng = torch.Generator(); rng.manual_seed(seed)
        with torch.no_grad():
            self.weight.copy_(torch.randn(out_features, in_features, generator=rng) / math.sqrt(in_features))
        self.weight.requires_grad = False
    def _save_to_state_dict(self, dest, prefix, keep_vars):
        pass  # Not saved — regenerated from seed
    def _load_from_state_dict(self, sd, prefix, meta, strict, missing, unexpected, errors):
        pass  # Not loaded — regenerated from seed

This is conceptually related to VeRA (Kopiczko et al., 2023), Extreme Learning Machines (Huang et al., 2006), and the Johnson-Lindenstrauss lemma — random projections preserve geometric structure.


2. Selective Freeze: Which Layers to Freeze?

I compared four freezing strategies on H100 with FineWeb sp1024 (3000 steps):

Config Layers Dim Frozen % CE vs Baseline Artifact
Baseline (fully trained) 6L 192d 0% 3.2531 2.4MB
Full freeze + LoRA r16 6L 192d 94% ~80% gap
Selective freeze gate+up 6L 192d 37% -2.1% 1.5MB
Selective + dropout 0.2 6L 192d 37% 3.4404 +5.8% 1.5MB
Selective freeze 8L 256d 37% 3.1427 -3.4% 3.3MB
Selective + dropout 12L 12L 384d 37% 2.8803 -11.5% 7.3MB
Fully trained 12L (no freeze) 12L 384d 0% 2.7295 -16.1% 17.7MB ❌

Key insight: The fully-trained 12L model achieves the best CE (2.7295) but needs 17.7MB — over the 16MB limit. Selective freeze enables a 12L model at 7.3MB that beats the smaller 6L baseline by 11.5%. The frozen weights act as a structural regularizer AND enable fitting more parameters per artifact byte.

Full freeze + LoRA fails (80% gap) because LoRA rank-16 cannot compensate for freezing ALL weights. Selective freeze (gate+up only, 37%) leaves attention and MLP down projection learnable — a much better tradeoff.


3. Progressive Freeze: Train First, Then Freeze

Random-init freeze has a weakness: the frozen weights are random, not trained. Progressive freeze addresses this:

  1. Train all weights normally for N steps (Phase 1)
  2. Freeze MLP gate+up projections (Phase 2)
  3. Continue training the remaining weights (Phase 3)

The frozen weights now contain trained features (not random), and the subsequent training adapts the rest of the network around them. This combines the regularization benefit of freezing with the quality of learned features.

A40, FineWeb sp4096 (3000 steps total, freeze at step 1000):

Config CE vs Baseline Delta vs Selective
Baseline 6L 192d 4.2132
Selective freeze 8L 256d (random init) 4.1767 -0.9%
Progressive freeze 8L 256d 4.1189 -2.2% +1.3pp better
Progressive freeze 12L 384d 3.8370 -8.9% +8.0pp better

Progressive freeze consistently outperforms random-init selective freeze at the same model size. The 12L 384d progressive freeze result (-8.9%) is the strongest finding in this work.

A40, Gutenberg validation (3000 steps):

Config CE vs Baseline
Baseline 6L 192d 1.2957
Direct freeze 8L 256d 1.2757 -1.5%
Progressive freeze 8L 256d 1.3049 +0.7%
Progressive+distill 8L 256d 1.3013 +0.4%

Note: Gutenberg and FineWeb results diverge — progressive freeze wins on FineWeb but not Gutenberg. This is consistent with the scale deception phenomenon documented in my PR #1259.


4. Frozen + Low-Rank Correction

Can a full frozen MLP be corrected with a learned low-rank term in parallel? output = frozen_mlp(x) + A @ B @ x where A, B are learned.

M4, Gutenberg (3000 steps, 6L 192d unless noted):

Rank CE vs Baseline
r=32 1.4310 +10.0%
r=64 1.4075 +8.2%
r=128 1.3823 +6.2%
12L 384d, r=64 1.3041 +0.23%

Low-rank correction converges toward the baseline as rank increases, and at 12L 384d nearly matches it (+0.23%). This validates that larger frozen architectures with small learned corrections can approach fully-trained quality — the tradeoff is extra compute for frozen layers vs artifact savings.


5. Self-Distillation + Freeze

Train a teacher, then distill knowledge to a larger freeze student:

Config CE vs Baseline
Teacher 6L 192d → Student 8L 256d freeze (1500+1500 steps) 1.3451 +3.8%

Cross-architecture distillation hurts because the teacher (dim=192) and student (dim=256) have different representation spaces. The student doesn't benefit from the teacher's knowledge when architectures differ significantly.


6. Progressive Freeze + Self-Distillation Combo

Config CE vs Baseline
1000 train + 1000 self-distill + 1000 frozen 1.3013 +0.4%

The self-distillation phase provides marginal benefit before the freeze. Progressive freeze alone (-2.2% on FineWeb) is simpler and more effective.


7. Dual Model Ensemble

Two smaller models in one 16MB artifact, average logits at eval:

Config BPC
Single 6L 192d 1.9797
Ensemble (2×3L 128d) 1.7660

Ensemble helps (+10.8%) but both individual models are weak. The artifact budget is better spent on one larger model with frozen weights than two small models.


8. Key Insights

  1. Selective freeze gate+up is the sweet spot — 37% frozen, leaving attention fully learnable. Full freeze + LoRA (94% frozen) catastrophically fails.

  2. Progressive freeze > random-init freeze — trained features before freezing give +1.3pp over random init on FineWeb sp4096. The frozen weights serve as regularization, not random projections.

  3. Bigger frozen > smaller learned — 12L 384d with 37% frozen (7.3MB artifact) beats 6L 192d fully trained (2.4MB artifact) by 11.5%. The artifact-per-BPB efficiency favors larger frozen architectures.

  4. Low-rank correction converges at scale — frozen+correction at 12L 384d nearly matches baseline (+0.23%), suggesting the frozen MLP acts as a good initialization that small corrections can refine.

  5. Scale matters critically — progressive freeze wins on FineWeb but loses on Gutenberg. See my PR Non-record: KNN Hidden State Retrieval — Scale Deception from Weak to Strong Models (8xH100) #1259 for analysis of why local results can be misleading.

  6. Cross-architecture distillation fails — teacher and student need compatible representation spaces.


9. Artifact Size Analysis

Config Total Params Learned Params Artifact (int6 est.)
Standard 11L 512d 34.4M 34.4M ~15.9MB
Selective 11L 512d 34.4M 21.7M ~10.0MB
Selective 13L 512d 40.2M 26.5M ~12.2MB

Selective freeze saves ~37% artifact space, enabling 2 extra layers within the 16MB budget. Combined with progressive freeze for quality, this is a viable path to higher-capacity models.


10. Implementation

FrozenFC class: Extends CastedLinear, overrides _save_to_state_dict and _load_from_state_dict to exclude frozen weights from serialization. Weights regenerated from seed at __init__.

Progressive freeze: Set PROGRESSIVE_FREEZE_FRAC=0.3 to freeze MLP fc weights after 30% of training steps.

torch.compile compatibility: FrozenFC requires fullgraph=False due to different computation graph from CastedLinear. This incurs a ~15% throughput penalty — an important consideration for wallclock-limited competition.

See companion code: selective_freeze_patch.py, record_train_gpt.py


Reproduction

# Selective freeze on FineWeb sp1024 (1×H100):
SELECTIVE_FREEZE=1 NUM_LAYERS=12 MODEL_DIM=384 \
torchrun --nproc_per_node=1 train_gpt.py

# Progressive freeze on FineWeb sp4096 (1×A40):
PROGRESSIVE_FREEZE_FRAC=0.3 NUM_LAYERS=8 MODEL_DIM=256 \
python3 exp_a40_apr4.py

Related Work

Attribution

Builds on Clark's train_gpt.py (PR #1218), competition baseline architecture, and FineWeb sp1024/sp4096 datasets. All experiments self-funded (~$45 compute across H100, A40, M4).

…e+Up Beats Full Freeze + LoRA

First FineWeb-validated implementation of OpenAI wishlist item. Selective freeze (37%) outperforms full freeze + LoRA (94%) by 40×. Larger frozen model beats smaller learned model by 11.5% at same artifact budget.
…alidation

- Progressive freeze: train fully then freeze mid-training (-2.2% on FineWeb sp4096, -8.9% at 12L 384d)
- Frozen + low-rank correction: approaches baseline at scale (+0.23% at 12L 384d)
- Self-distillation + freeze: cross-architecture distillation hurts (+3.8%)
- Progressive + self-distill combo: marginal benefit (+0.4%)
- Dual model ensemble: individually weak, artifact better spent on one large model
- A40 FineWeb sp4096 validation (exp_a40_apr4.py)
- Overnight architecture search results (exp_overnight_apr4.py)
- Key finding: progressive freeze > random-init freeze by 1.3 percentage points

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@himanshudongre himanshudongre changed the title Non-record: Selective Freeze on Random Linear Maps — Why Freezing Gate+Up Beats Full Freeze + LoRA Non-Record: Learning Adapters on Random Linear Maps — Selective Freeze, Progressive Freeze, and 7 Architecture Variants (H100 + A40 Validated) Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant