Skip to content

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence#1426

Closed
aravhawk wants to merge 7 commits intoopenai:mainfrom
aravhawk:14L-int4-packed-gptq
Closed

13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence#1426
aravhawk wants to merge 7 commits intoopenai:mainfrom
aravhawk:14L-int4-packed-gptq

Conversation

@aravhawk
Copy link
Copy Markdown

@aravhawk aravhawk commented Apr 6, 2026

Summary

13 physical transformer layers (first submission beyond 11) enabled by true int4 bit packing for MLP weights, a novel compression technique that stores two 4 bit quantized values in a single byte. This halves raw MLP storage before LZMA/Brotli, funding 2 extra layers within the 16MB budget. Full Hessian GPTQ with Cholesky error compensation ensures int4 quality stays close to int6.

Builds on the full SOTA stack from @abaybektursun (PR #549 derivative) with all proven techniques from the latest top submissions.

Novel Contributions

True int4 bit packing (pack_int4/unpack_int4, 17 lines): Nobody in the competition stores quantized weights at native bitwidth. All other submissions store int6 values in full int8 bytes and rely on LZMA/Brotli to compress the unused range. Our packing eliminates this waste entirely for MLP weights, achieving 0.5 bytes per value before compression even runs.

13 physical layers with depth recurrence = 15+ virtual layers (vs 11 physical everywhere else). The extra unique layers provide genuinely new representational capacity, not just repeated passes through the same weights.

Architecture

  • 13 layers, 512 dim, 8 heads, 4 KV heads (GQA)
  • MLP 3x expansion (hidden=1536), LeakyReLU(0.5)^2
  • U-Net: encoder 6, decoder 7, 6 skip connections
  • XSA on all 13 layers
  • BigramHash(4096, dim=112) + Trigram (zero extra params)
  • Value Embedding (dim=128) at layers 10, 11, 12
  • SmearGate, Partial RoPE (16/64), LN Scale (1/sqrt(layer+1))
  • Tied embeddings, logit softcap 30.0, QK Gain 5.0

Technique Stack

Technique Source Impact
13 layers + int4 packed MLP GPTQ Novel More unique capacity in 16MB
True int4 bit packing (2 vals/byte) Novel Halves raw MLP storage
Pre Quant TTT (6 epoch AdamW, freeze 2 blocks, cosine LR) PR #1364 ~0.034 bpb
Depth recurrence layers 4,5 (15 virtual) PR #1204, #1420 ~0.005 bpb
MuonEq R (row normalized Muon) PR #1217, #1260 ~0.001 bpb
QK Gain 5.0 PR #1217, #1423 ~0.005 bpb
Muon WD=0.085, Adam WD=0.02 PR #1394, #1218 Better compression
Trigram hash (zero extra params) Existing code ~0.002 bpb
BigramHash 4096x112 Scaled from 3072 ~0.001 bpb
3 VE layers (10,11,12) Extended from 2 ~0.001 bpb
Full Hessian GPTQ (AR self gen calib) SOTA ~0.007 bpb
XSA all layers PR #478 ~0.005 bpb
LeakyReLU(0.5)^2 PR #493 ~0.003 bpb
Parallel Muon + parameter banks PR #399 Systems opt
EMA(0.997) + Late QAT SOTA ~0.002 bpb
Warmdown 66.7%, LR 0.02 PR #1394 Training opt

Quantization Details

MLP weights: int4 (clip_range=7) with Full Hessian GPTQ, then packed into uint8 via nibble packing. Attention weights: int6 (clip_range=31) with Full Hessian GPTQ, stored as int8. Embedding: int8 passthrough. Scales: fp16 per row. Compression: LZMA preset 9 + selective pruning.

Estimated artifact size: ~14.97 MB (validated via smoke test on random weights, with 0.82x GPTQ correction factor applied).

Hyperparameters

num_layers=13, model_dim=512, num_heads=8, num_kv_heads=4
mlp_mult=3.0, vocab_size=1024, train_seq_len=2048
matrix_lr=0.02, scalar_lr=0.02, tied_embed_lr=0.03
muon_wd=0.085, adam_wd=0.02, muon_momentum=0.99
qk_gain_init=5.0, warmdown_iters=3500, grad_clip=0.3
bigram_vocab_size=4096, bigram_dim=112, trigram=1
xsa_last_n=13, ve_layers=10,11,12
recur_layers=4,5, recur_extra_loops=1
ttt_epochs=6, ttt_lr=0.0005, ttt_freeze_blocks=2

All configurable via environment variables for ablation.

SP8192 Migration Path

Current defaults use SP1024. For SP8192 mode (all top 5 use this):

DATA_PATH=./data/datasets/fineweb10B_sp8192
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model
VOCAB_SIZE=8192

Status

val_bpb: TBD (pending 3 seed evaluation on 8xH100)

Smoke test validated on CPU: int4 packing roundtrip perfect, artifact size estimate 14.97 MB (fits 16,000,000 byte cap).

Test plan

  • Verify artifact fits in 16,000,000 bytes on 8xH100
  • Run 3 seeds (42, 314, 999)
  • Verify pack/unpack roundtrip preserves weights exactly
  • Compare val_bpb against current SOTA
  • Test with NUM_LAYERS=12 and NUM_LAYERS=14 as fallback/stretch

Built on SOTA by @abaybektursun, with techniques from @clarkkev, @stukenov, @msisovic, @gowtham0992, @parinzee, @bigbag, @dexhunter.

aravhawk added 7 commits April 6, 2026 16:49
14 layers (first beyond 11) funded by true int4 bit-packing for MLP
weights. Novel pack_int4/unpack_int4 stores 2 quantized values per byte,
halving raw MLP storage before LZMA. Full Hessian GPTQ with AR
self-generated calibration. Built on abaybektursun SOTA.
Adjusted from 14L to 13L default after empirical LZMA compression
testing showed 14L is tight (~0.6MB headroom). 13L has ~2MB headroom.
Int4 bit-packing and all GPTQ innovations preserved. NUM_LAYERS=14
available as stretch goal.
Stack proven optimizations from top submissions:
- QK-Gain 5.0 (from 1.5, monotonic gains proven in 45 experiments)
- Depth recurrence layers 4,5 (15 virtual layers from 13 physical)
- Trigram hash enabled (zero extra params, reuses bigram table)
- BigramHash 4096 (from 3072, using int4 size headroom)
- VE layers 10,11,12 (3 layers from 2)
Biggest remaining technique from top submissions: fine-tune EMA weights
on training data before GPTQ quantization. 6 epochs, AdamW lr=0.0005
with cosine decay, freeze first 2 blocks. Expected -0.034 bpb gain.
Configurable via TTT_EPOCHS, TTT_LR, TTT_FREEZE_BLOCKS env vars.
Set TTT_EPOCHS=0 to disable.
Bugs fixed:
- Trigram default mismatch: GPT.__init__ used "0" while Hyperparameters
  used "1". Now consistently "1" everywhere.
- TTT bank freezing was a no-op (.data[i].requires_grad doesn't work on
  slices). Replaced with gradient masking after backward pass.
- TTT steps calculation used val_tokens instead of train_tokens count.

Optimizations from research team:
- MuonEq-R: add row normalization before NS5 (~0.001 bpb free gain)
- muon_wd 0.04 -> 0.085 (matches all top 5, better compression)
- adam_wd 0.04 -> 0.02 (per clarkkev's finding)
- matrix_lr 0.025 -> 0.02 (matches SP8192 base, better for larger models)
- scalar_lr 0.025 -> 0.02
- tied_embed_lr 0.035 -> 0.03
- warmdown_iters 2800 -> 3500 (~66.7% of training, matches all top 5)

These match the proven hyperparameters from clarkkev's SP8192 base
(PR openai#1394) and every top 5 submission.
@aravhawk aravhawk changed the title 14L Int4-Packed MLP GPTQ + XSA-all + BigramHash 3072x112 13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence Apr 7, 2026
@aravhawk aravhawk closed this Apr 7, 2026
@aravhawk aravhawk deleted the 14L-int4-packed-gptq branch April 7, 2026 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant