Add record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli (1.1020 BPB)#1392
Open
Its-Just-Crump wants to merge 1 commit intoopenai:mainfrom
Open
Conversation
…1.1020 BPB) 3-seed validated mean 1.1020 BPB, beating current SOTA (1.1147) by 0.0127 BPB / 0.0088 nats (Welch t=-18.37, p<0.001). SP4096 tokenizer, MLP 4x, depth recurrence layers 4-5, parallel residuals layer 7+, QK-Gain 5.0, MuonEq-R, ADAM_WD=0.090, GPTQ 128-batch, Brotli. All seeds under 15.88MB (124KB+ margin).
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: SP4096 + Depth Recurrence + Parallel Residuals + QK-Gain + Brotli
val_bpb: 1.1020 (3-seed mean, std 0.0011) | ~15.88 MB max | 8xH100 SXM, 600s | No TTT
Improvement over current merged SOTA (PR #1019, 1.1147 BPB): -0.0127 BPB / -0.0088 nats (Welch t=-18.37, df=2.38, p<0.001)
Results
Spread across seeds: 0.0023 BPB (very tight). All 3 seeds fit under 16MB with >=124KB margin.
Tokenizer Change: BPB Correctness Proof
This submission uses a SentencePiece 4096 BPE tokenizer (
fineweb_4096_bpe.model) instead of the baseline SP1024. Per competition rules, we provide detailed proof that val_bpb is correctly calculated.How BPB is computed in this script:
The
val_bpbmetric is computed by the samesliding_window_bpb()function used by all submissions in this repo. The function:token_byte_lengths[token_id]for each tokenBPB = total_nats / (total_bytes * ln(2))The
token_byte_lengthslookup table is built bybuild_sentencepiece_luts(), which inspects each token's UTF-8 byte length viasp.id_to_piece(token_id). This is independent of vocabulary size — a token that represents "the" is 3 bytes whether the vocab is 1024 or 4096.Key invariant: The total byte count of the validation set is identical regardless of tokenizer, because every tokenizer produces a lossless segmentation of the same byte sequence. More tokens (SP1024) or fewer tokens (SP4096) — the bytes sum is the same. Therefore BPB is a fair cross-tokenizer comparison.
Verification from logs: The validation set has
tokens:45508608SP4096 tokens. At ~3.32 bytes/token average, this covers the same ~151M byte validation set used by SP1024 submissions (which have ~131M tokens at ~1.15 bytes/token). The per-token cross-entropy is higher with SP4096 (2.54 nats vs 1.88 nats) because each token covers more bytes, but the per-byte rate (BPB) is directly comparable.What Changed vs PR #1019
This submission replaces the SP1024 + BigramHash + LZMA stack with a SP4096-native architecture that gets more capacity from the larger vocabulary and recurrent/parallel techniques instead of explicit bigram features.
1. SP4096 Tokenizer + MLP 4x (from SP1024 + MLP 3x)
Switching to a 4096-token SentencePiece vocabulary with 4x MLP multiplier increases model capacity from ~27M to 34.4M parameters. The larger vocabulary captures more subword patterns natively, eliminating the need for BigramHash (which compresses 3.4x worse per parameter with SP4096).
2. Depth Recurrence (Layers 4-5 from Step 3000)
After step 3000, layers 4 and 5 are re-executed, effectively giving the model 13 logical layers for the cost of 11 layers' parameters. This adds zero parameters — it's purely a compute-time technique that trades ~10% wall-clock time for improved representation depth. Source: PR #1260 ablation, estimated -0.0035 BPB.
3. Parallel Residuals (Layer 7+)
From layer 7 onward, the MLP and attention outputs are merged through a learned
lane_mergescalar andresid_mix_mlpvector per layer (~20KB raw, ~3-5KB compressed). This allows the model to balance attention vs MLP contributions dynamically. Source: PR #1289, estimated -0.0035 BPB.4. QK-Gain 5.0
Initializes query and key projections with 5x scale, sharpening attention from the start of training without any parameter cost. Source: PR #1217 (45 experiments), estimated -0.001 BPB.
5. MuonEq-R Optimizer
Row-norm normalization before Newton-Schulz iteration in Muon. ~15 lines of code, zero parameter cost, minor but consistent improvement. Source: PR #1334.
6. ADAM_WD=0.090 + GPTQ Tuning
Increased Adam weight decay from 0.02 to 0.090 (matching Muon WD). GPTQ calibration increased from 64 to 128 AR self-generated sequences for denser Hessian estimates with the larger SP4096 model. Dampening factor tuned to 0.01.
7. Brotli Compression (from LZMA)
SP4096 int6 weights compress better under Brotli than LZMA. This switch recovers the size headroom that BigramHash removal freed up.
Dropped vs PR #1019
Architecture
Requirements
Flash Attention 3 (Hopper) is required.
pip install --break-system-packages flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291 pip install sentencepiece zstandard brotli python3 -c "from flash_attn_interface import flash_attn_func; import sentencepiece, zstandard, brotli; print('deps OK')"Run Command
VOCAB_SIZE=4096 MLP_MULT=4.0 QK_GAIN_INIT=5.0 MUON_EQ_R=1 \ RECUR_LAYERS="4,5" RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \ MUON_WD=0.090 ADAM_WD=0.090 WARMDOWN_ITERS=4000 \ GPTQ_CALIB_BATCHES=128 GPTQ_DAMP=0.01 \ BIGRAM_VOCAB_SIZE=0 TRIGRAM=0 TARGET_MB=15.9 SEED=42 \ torchrun --standalone --nproc_per_node=8 train_gpt.pyLineage