13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence by aravhawk · Pull Request #1426 · openai/parameter-golf

aravhawk · 2026-04-06T21:50:12Z

Summary

13 physical transformer layers (first submission beyond 11) enabled by true int4 bit packing for MLP weights, a novel compression technique that stores two 4 bit quantized values in a single byte. This halves raw MLP storage before LZMA/Brotli, funding 2 extra layers within the 16MB budget. Full Hessian GPTQ with Cholesky error compensation ensures int4 quality stays close to int6.

Builds on the full SOTA stack from @abaybektursun (PR #549 derivative) with all proven techniques from the latest top submissions.

Novel Contributions

True int4 bit packing (pack_int4/unpack_int4, 17 lines): Nobody in the competition stores quantized weights at native bitwidth. All other submissions store int6 values in full int8 bytes and rely on LZMA/Brotli to compress the unused range. Our packing eliminates this waste entirely for MLP weights, achieving 0.5 bytes per value before compression even runs.

13 physical layers with depth recurrence = 15+ virtual layers (vs 11 physical everywhere else). The extra unique layers provide genuinely new representational capacity, not just repeated passes through the same weights.

Architecture

13 layers, 512 dim, 8 heads, 4 KV heads (GQA)
MLP 3x expansion (hidden=1536), LeakyReLU(0.5)^2
U-Net: encoder 6, decoder 7, 6 skip connections
XSA on all 13 layers
BigramHash(4096, dim=112) + Trigram (zero extra params)
Value Embedding (dim=128) at layers 10, 11, 12
SmearGate, Partial RoPE (16/64), LN Scale (1/sqrt(layer+1))
Tied embeddings, logit softcap 30.0, QK Gain 5.0

Technique Stack

Technique	Source	Impact
13 layers + int4 packed MLP GPTQ	Novel	More unique capacity in 16MB
True int4 bit packing (2 vals/byte)	Novel	Halves raw MLP storage
Pre Quant TTT (6 epoch AdamW, freeze 2 blocks, cosine LR)	PR #1364	~0.034 bpb
Depth recurrence layers 4,5 (15 virtual)	PR #1204, #1420	~0.005 bpb
MuonEq R (row normalized Muon)	PR #1217, #1260	~0.001 bpb
QK Gain 5.0	PR #1217, #1423	~0.005 bpb
Muon WD=0.085, Adam WD=0.02	PR #1394, #1218	Better compression
Trigram hash (zero extra params)	Existing code	~0.002 bpb
BigramHash 4096x112	Scaled from 3072	~0.001 bpb
3 VE layers (10,11,12)	Extended from 2	~0.001 bpb
Full Hessian GPTQ (AR self gen calib)	SOTA	~0.007 bpb
XSA all layers	PR #478	~0.005 bpb
LeakyReLU(0.5)^2	PR #493	~0.003 bpb
Parallel Muon + parameter banks	PR #399	Systems opt
EMA(0.997) + Late QAT	SOTA	~0.002 bpb
Warmdown 66.7%, LR 0.02	PR #1394	Training opt

Quantization Details

MLP weights: int4 (clip_range=7) with Full Hessian GPTQ, then packed into uint8 via nibble packing. Attention weights: int6 (clip_range=31) with Full Hessian GPTQ, stored as int8. Embedding: int8 passthrough. Scales: fp16 per row. Compression: LZMA preset 9 + selective pruning.

Estimated artifact size: ~14.97 MB (validated via smoke test on random weights, with 0.82x GPTQ correction factor applied).

Hyperparameters

num_layers=13, model_dim=512, num_heads=8, num_kv_heads=4
mlp_mult=3.0, vocab_size=1024, train_seq_len=2048
matrix_lr=0.02, scalar_lr=0.02, tied_embed_lr=0.03
muon_wd=0.085, adam_wd=0.02, muon_momentum=0.99
qk_gain_init=5.0, warmdown_iters=3500, grad_clip=0.3
bigram_vocab_size=4096, bigram_dim=112, trigram=1
xsa_last_n=13, ve_layers=10,11,12
recur_layers=4,5, recur_extra_loops=1
ttt_epochs=6, ttt_lr=0.0005, ttt_freeze_blocks=2

All configurable via environment variables for ablation.

SP8192 Migration Path

Current defaults use SP1024. For SP8192 mode (all top 5 use this):

DATA_PATH=./data/datasets/fineweb10B_sp8192
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model
VOCAB_SIZE=8192

Status

val_bpb: TBD (pending 3 seed evaluation on 8xH100)

Smoke test validated on CPU: int4 packing roundtrip perfect, artifact size estimate 14.97 MB (fits 16,000,000 byte cap).

Test plan

Verify artifact fits in 16,000,000 bytes on 8xH100
Run 3 seeds (42, 314, 999)
Verify pack/unpack roundtrip preserves weights exactly
Compare val_bpb against current SOTA
Test with NUM_LAYERS=12 and NUM_LAYERS=14 as fallback/stretch

Built on SOTA by @abaybektursun, with techniques from @clarkkev, @stukenov, @msisovic, @gowtham0992, @parinzee, @bigbag, @dexhunter.

14 layers (first beyond 11) funded by true int4 bit-packing for MLP weights. Novel pack_int4/unpack_int4 stores 2 quantized values per byte, halving raw MLP storage before LZMA. Full Hessian GPTQ with AR self-generated calibration. Built on abaybektursun SOTA.

Adjusted from 14L to 13L default after empirical LZMA compression testing showed 14L is tight (~0.6MB headroom). 13L has ~2MB headroom. Int4 bit-packing and all GPTQ innovations preserved. NUM_LAYERS=14 available as stretch goal.

Stack proven optimizations from top submissions: - QK-Gain 5.0 (from 1.5, monotonic gains proven in 45 experiments) - Depth recurrence layers 4,5 (15 virtual layers from 13 physical) - Trigram hash enabled (zero extra params, reuses bigram table) - BigramHash 4096 (from 3072, using int4 size headroom) - VE layers 10,11,12 (3 layers from 2)

Biggest remaining technique from top submissions: fine-tune EMA weights on training data before GPTQ quantization. 6 epochs, AdamW lr=0.0005 with cosine decay, freeze first 2 blocks. Expected -0.034 bpb gain. Configurable via TTT_EPOCHS, TTT_LR, TTT_FREEZE_BLOCKS env vars. Set TTT_EPOCHS=0 to disable.

Bugs fixed: - Trigram default mismatch: GPT.__init__ used "0" while Hyperparameters used "1". Now consistently "1" everywhere. - TTT bank freezing was a no-op (.data[i].requires_grad doesn't work on slices). Replaced with gradient masking after backward pass. - TTT steps calculation used val_tokens instead of train_tokens count. Optimizations from research team: - MuonEq-R: add row normalization before NS5 (~0.001 bpb free gain) - muon_wd 0.04 -> 0.085 (matches all top 5, better compression) - adam_wd 0.04 -> 0.02 (per clarkkev's finding)

- matrix_lr 0.025 -> 0.02 (matches SP8192 base, better for larger models) - scalar_lr 0.025 -> 0.02 - tied_embed_lr 0.035 -> 0.03 - warmdown_iters 2800 -> 3500 (~66.7% of training, matches all top 5) These match the proven hyperparameters from clarkkev's SP8192 base (PR openai#1394) and every top 5 submission.

aravhawk added 7 commits April 6, 2026 16:49

Update README with full technique stack and SP8192 migration path

6a4c5ed

aravhawk changed the title ~~14L Int4-Packed MLP GPTQ + XSA-all + BigramHash 3072x112~~ 13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence Apr 7, 2026

aravhawk closed this Apr 7, 2026

aravhawk deleted the 14L-int4-packed-gptq branch April 7, 2026 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence#1426

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence#1426
aravhawk wants to merge 7 commits intoopenai:mainfrom
aravhawk:14L-int4-packed-gptq

aravhawk commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aravhawk commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Novel Contributions

Architecture

Technique Stack

Quantization Details

Hyperparameters

SP8192 Migration Path

Status

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aravhawk commented Apr 6, 2026 •

edited

Loading