Skip to content

[Submission] Random LinearMaps + LoRA Adapters#1295

Open
austinluk wants to merge 3 commits intoopenai:mainfrom
austinluk:submission/random-linear-maps-lora
Open

[Submission] Random LinearMaps + LoRA Adapters#1295
austinluk wants to merge 3 commits intoopenai:mainfrom
austinluk:submission/random-linear-maps-lora

Conversation

@austinluk
Copy link
Copy Markdown

No description provided.

@himanshudongre
Copy link
Copy Markdown

Great to see someone else exploring this direction! I've been working on the same wishlist item and just submitted my findings in PR #1301.

TL;DR: Your "Potential Improvements" section nails it — selective freezing is the key.

I tested both full freeze + adapters (your approach) and selective freeze (freeze only MLP gate+up, learn attention fully) on FineWeb data. The results are dramatic:

Approach Frozen% Best CE (FineWeb) vs Baseline
Full freeze + VeRA rank=8 94% 2.3388 +80% gap
Full freeze + VeRA rank=16 94% 2.3288 +79% gap
Full freeze + VeRA rank=32 94% 2.3221 +79% gap
Selective freeze (gate+up only) 37% 1.2792 -1.5% BETTER than baseline

Increasing adapter rank from 8→32 barely helps — the bottleneck is frozen attention weights that can't learn relational patterns, not adapter capacity.

The fix: freeze only the MLP gate and up projections (feature expansion — where Johnson-Lindenstrauss applies naturally), learn everything else. This preserves the model's ability to learn attention patterns while getting artifact savings from frozen random projections.

On the artifact-normalized comparison (the real competition question), a larger frozen model beats a smaller fully-trained model at the same artifact budget:

Config CE (FineWeb) Artifact
6L 192d fully-trained + dropout 3.2531 2.4MB
12L 384d selective freeze + dropout 2.8803 7.3MB

The frozen model has 4× more effective params at 3× the artifact cost — and it wins by 11.5%.

Full details + code in PR #1301. Would be interesting to see if your 12L 768d backbone with selective freeze (learn attention, freeze only MLP gate+up) closes the gap further.

@himanshudongre
Copy link
Copy Markdown

Related work: I've been running extensive experiments on selective freeze (freezing gate+up projections only, 37% frozen) as an alternative to your full freeze + LoRA approach.

Key finding: selective freeze (37% frozen) dramatically outperforms full freeze + LoRA (94% frozen) — the LoRA approach has an ~80% quality gap while selective freeze shows -2.1% improvement over baseline on H100.

I also developed "progressive freeze" — train all weights fully for N steps, then freeze mid-training. This outperforms random-init freeze by 1.3 percentage points on FineWeb sp4096.

Full results with 7 architecture variants across H100 and A40: PR #1301.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants