Non-Record: Learning Adapters on Random Linear Maps — Selective Freeze, Progressive Freeze, and 7 Architecture Variants (H100 + A40 Validated)#1301
Open
himanshudongre wants to merge 2 commits intoopenai:mainfrom
Conversation
…e+Up Beats Full Freeze + LoRA First FineWeb-validated implementation of OpenAI wishlist item. Selective freeze (37%) outperforms full freeze + LoRA (94%) by 40×. Larger frozen model beats smaller learned model by 11.5% at same artifact budget.
…alidation - Progressive freeze: train fully then freeze mid-training (-2.2% on FineWeb sp4096, -8.9% at 12L 384d) - Frozen + low-rank correction: approaches baseline at scale (+0.23% at 12L 384d) - Self-distillation + freeze: cross-architecture distillation hurts (+3.8%) - Progressive + self-distill combo: marginal benefit (+0.4%) - Dual model ensemble: individually weak, artifact better spent on one large model - A40 FineWeb sp4096 validation (exp_a40_apr4.py) - Overnight architecture search results (exp_overnight_apr4.py) - Key finding: progressive freeze > random-init freeze by 1.3 percentage points Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First systematic investigation of random linear maps for Parameter Golf, directly addressing the Requests for PRs item "Learning adapters on random linear maps." This work evaluates 7 architecture variants across 3 hardware configurations (H100, A40, M4) with FineWeb sp1024/sp4096 validation, totaling ~25 experiments at ~$45 self-funded compute.
Core finding: Selectively freezing MLP gate+up projections as deterministic random (from seeds, 0 bytes in artifact) enables fitting larger models in 16MB. A 12L frozen model beats a 6L fully-trained model by 11.5% on FineWeb. Progressive freeze (train fully, then freeze mid-training) outperforms random-init freeze by 1.3 percentage points on FineWeb sp4096.
Checks off "Learning adapters on random linear maps" from Requests for PRs.
1. The Idea
In Parameter Golf, the artifact budget (16MB) limits model size. But what if some weights cost 0 bytes?
Selective Freeze: Replace MLP gate+up projections with deterministic random matrices generated from per-layer seeds. At eval time, regenerate from seeds — zero artifact cost. Only attention weights + MLP down projections are learned and stored.
This is conceptually related to VeRA (Kopiczko et al., 2023), Extreme Learning Machines (Huang et al., 2006), and the Johnson-Lindenstrauss lemma — random projections preserve geometric structure.
2. Selective Freeze: Which Layers to Freeze?
I compared four freezing strategies on H100 with FineWeb sp1024 (3000 steps):
Key insight: The fully-trained 12L model achieves the best CE (2.7295) but needs 17.7MB — over the 16MB limit. Selective freeze enables a 12L model at 7.3MB that beats the smaller 6L baseline by 11.5%. The frozen weights act as a structural regularizer AND enable fitting more parameters per artifact byte.
Full freeze + LoRA fails (80% gap) because LoRA rank-16 cannot compensate for freezing ALL weights. Selective freeze (gate+up only, 37%) leaves attention and MLP down projection learnable — a much better tradeoff.
3. Progressive Freeze: Train First, Then Freeze
Random-init freeze has a weakness: the frozen weights are random, not trained. Progressive freeze addresses this:
The frozen weights now contain trained features (not random), and the subsequent training adapts the rest of the network around them. This combines the regularization benefit of freezing with the quality of learned features.
A40, FineWeb sp4096 (3000 steps total, freeze at step 1000):
Progressive freeze consistently outperforms random-init selective freeze at the same model size. The 12L 384d progressive freeze result (-8.9%) is the strongest finding in this work.
A40, Gutenberg validation (3000 steps):
Note: Gutenberg and FineWeb results diverge — progressive freeze wins on FineWeb but not Gutenberg. This is consistent with the scale deception phenomenon documented in my PR #1259.
4. Frozen + Low-Rank Correction
Can a full frozen MLP be corrected with a learned low-rank term in parallel?
output = frozen_mlp(x) + A @ B @ xwhere A, B are learned.M4, Gutenberg (3000 steps, 6L 192d unless noted):
Low-rank correction converges toward the baseline as rank increases, and at 12L 384d nearly matches it (+0.23%). This validates that larger frozen architectures with small learned corrections can approach fully-trained quality — the tradeoff is extra compute for frozen layers vs artifact savings.
5. Self-Distillation + Freeze
Train a teacher, then distill knowledge to a larger freeze student:
Cross-architecture distillation hurts because the teacher (dim=192) and student (dim=256) have different representation spaces. The student doesn't benefit from the teacher's knowledge when architectures differ significantly.
6. Progressive Freeze + Self-Distillation Combo
The self-distillation phase provides marginal benefit before the freeze. Progressive freeze alone (-2.2% on FineWeb) is simpler and more effective.
7. Dual Model Ensemble
Two smaller models in one 16MB artifact, average logits at eval:
Ensemble helps (+10.8%) but both individual models are weak. The artifact budget is better spent on one larger model with frozen weights than two small models.
8. Key Insights
Selective freeze gate+up is the sweet spot — 37% frozen, leaving attention fully learnable. Full freeze + LoRA (94% frozen) catastrophically fails.
Progressive freeze > random-init freeze — trained features before freezing give +1.3pp over random init on FineWeb sp4096. The frozen weights serve as regularization, not random projections.
Bigger frozen > smaller learned — 12L 384d with 37% frozen (7.3MB artifact) beats 6L 192d fully trained (2.4MB artifact) by 11.5%. The artifact-per-BPB efficiency favors larger frozen architectures.
Low-rank correction converges at scale — frozen+correction at 12L 384d nearly matches baseline (+0.23%), suggesting the frozen MLP acts as a good initialization that small corrections can refine.
Scale matters critically — progressive freeze wins on FineWeb but loses on Gutenberg. See my PR Non-record: KNN Hidden State Retrieval — Scale Deception from Weak to Strong Models (8xH100) #1259 for analysis of why local results can be misleading.
Cross-architecture distillation fails — teacher and student need compatible representation spaces.
9. Artifact Size Analysis
Selective freeze saves ~37% artifact space, enabling 2 extra layers within the 16MB budget. Combined with progressive freeze for quality, this is a viable path to higher-capacity models.
10. Implementation
FrozenFC class: Extends CastedLinear, overrides
_save_to_state_dictand_load_from_state_dictto exclude frozen weights from serialization. Weights regenerated from seed at__init__.Progressive freeze: Set
PROGRESSIVE_FREEZE_FRAC=0.3to freeze MLP fc weights after 30% of training steps.torch.compile compatibility: FrozenFC requires
fullgraph=Falsedue to different computation graph from CastedLinear. This incurs a ~15% throughput penalty — an important consideration for wallclock-limited competition.See companion code:
selective_freeze_patch.py,record_train_gpt.pyReproduction
Related Work
Attribution
Builds on Clark's train_gpt.py (PR #1218), competition baseline architecture, and FineWeb sp1024/sp4096 datasets. All experiments self-funded (~$45 compute across H100, A40, M4).