13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence#1426
Closed
aravhawk wants to merge 7 commits intoopenai:mainfrom
Closed
13L Int4-Packed MLP GPTQ + MuonEq-R + Pre-Quant TTT + Depth Recurrence#1426aravhawk wants to merge 7 commits intoopenai:mainfrom
aravhawk wants to merge 7 commits intoopenai:mainfrom
Conversation
14 layers (first beyond 11) funded by true int4 bit-packing for MLP weights. Novel pack_int4/unpack_int4 stores 2 quantized values per byte, halving raw MLP storage before LZMA. Full Hessian GPTQ with AR self-generated calibration. Built on abaybektursun SOTA.
Adjusted from 14L to 13L default after empirical LZMA compression testing showed 14L is tight (~0.6MB headroom). 13L has ~2MB headroom. Int4 bit-packing and all GPTQ innovations preserved. NUM_LAYERS=14 available as stretch goal.
Stack proven optimizations from top submissions: - QK-Gain 5.0 (from 1.5, monotonic gains proven in 45 experiments) - Depth recurrence layers 4,5 (15 virtual layers from 13 physical) - Trigram hash enabled (zero extra params, reuses bigram table) - BigramHash 4096 (from 3072, using int4 size headroom) - VE layers 10,11,12 (3 layers from 2)
Biggest remaining technique from top submissions: fine-tune EMA weights on training data before GPTQ quantization. 6 epochs, AdamW lr=0.0005 with cosine decay, freeze first 2 blocks. Expected -0.034 bpb gain. Configurable via TTT_EPOCHS, TTT_LR, TTT_FREEZE_BLOCKS env vars. Set TTT_EPOCHS=0 to disable.
Bugs fixed: - Trigram default mismatch: GPT.__init__ used "0" while Hyperparameters used "1". Now consistently "1" everywhere. - TTT bank freezing was a no-op (.data[i].requires_grad doesn't work on slices). Replaced with gradient masking after backward pass. - TTT steps calculation used val_tokens instead of train_tokens count. Optimizations from research team: - MuonEq-R: add row normalization before NS5 (~0.001 bpb free gain) - muon_wd 0.04 -> 0.085 (matches all top 5, better compression) - adam_wd 0.04 -> 0.02 (per clarkkev's finding)
- matrix_lr 0.025 -> 0.02 (matches SP8192 base, better for larger models) - scalar_lr 0.025 -> 0.02 - tied_embed_lr 0.035 -> 0.03 - warmdown_iters 2800 -> 3500 (~66.7% of training, matches all top 5) These match the proven hyperparameters from clarkkev's SP8192 base (PR openai#1394) and every top 5 submission.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
13 physical transformer layers (first submission beyond 11) enabled by true int4 bit packing for MLP weights, a novel compression technique that stores two 4 bit quantized values in a single byte. This halves raw MLP storage before LZMA/Brotli, funding 2 extra layers within the 16MB budget. Full Hessian GPTQ with Cholesky error compensation ensures int4 quality stays close to int6.
Builds on the full SOTA stack from @abaybektursun (PR #549 derivative) with all proven techniques from the latest top submissions.
Novel Contributions
True int4 bit packing (
pack_int4/unpack_int4, 17 lines): Nobody in the competition stores quantized weights at native bitwidth. All other submissions store int6 values in full int8 bytes and rely on LZMA/Brotli to compress the unused range. Our packing eliminates this waste entirely for MLP weights, achieving 0.5 bytes per value before compression even runs.13 physical layers with depth recurrence = 15+ virtual layers (vs 11 physical everywhere else). The extra unique layers provide genuinely new representational capacity, not just repeated passes through the same weights.
Architecture
Technique Stack
Quantization Details
MLP weights: int4 (clip_range=7) with Full Hessian GPTQ, then packed into uint8 via nibble packing. Attention weights: int6 (clip_range=31) with Full Hessian GPTQ, stored as int8. Embedding: int8 passthrough. Scales: fp16 per row. Compression: LZMA preset 9 + selective pruning.
Estimated artifact size: ~14.97 MB (validated via smoke test on random weights, with 0.82x GPTQ correction factor applied).
Hyperparameters
All configurable via environment variables for ablation.
SP8192 Migration Path
Current defaults use SP1024. For SP8192 mode (all top 5 use this):
Status
val_bpb: TBD (pending 3 seed evaluation on 8xH100)
Smoke test validated on CPU: int4 packing roundtrip perfect, artifact size estimate 14.97 MB (fits 16,000,000 byte cap).
Test plan
Built on SOTA by @abaybektursun, with techniques from @clarkkev, @stukenov, @msisovic, @gowtham0992, @parinzee, @bigbag, @dexhunter.