11L + XSA4 + EMA(0.997) + seq2048 + Int5-MLP + MuonWD=0.04 + LateK-FP16 | val_bpb=1.1361#372
Closed
HyperPotatoNeo wants to merge 3 commits intoopenai:mainfrom
Conversation
Stacks XSA (PR openai#265), EMA weight averaging (PR openai#287), Int5-MLP (PR openai#264), MuonWD=0.04 tuned from PR openai#162, seq_len=2048, 11 layers, BigramHash(2048), SmearGate, OrthoInit (PR openai#135), Late-K FP16 on final layer. Single-seed result (seed=1337), ~8903 steps on 8xH100.
Author
|
Closing this PR: the artifact size was incorrect. The model artifact is 16,564,943 bytes which exceeds the 16,000,000 byte decimal limit. The submission.json bytes_total field was stale/incorrect. A corrected submission will be opened when ready. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
11L + XSA4 + EMA(0.997) + seq2048 + Int5-MLP + MuonWD=0.04 + LateK-FP16
val_bpb: 1.1361 (avg over seeds 1337/42/123, sliding window stride=64, post int5/int6+zstd-22 quantization roundtrip)
Hardware: 8×H100-80GB-SXM5 | Steps: ~9126 avg | Wallclock: 600s | Artifact: ~15.79MB
Approach
This submission stacks eight techniques on the baseline, building directly on PRs #135, #162, #264, #265, and #287. All changes are incremental over those prior submissions — no new external dependencies are required.
1. Per-Row Int6 Quantization + Int5-MLP + zstd-22 (PR #264, extended)
Attention weight matrices are quantized to int6 (symmetric, 63 levels, per-row scaling). MLP weights (
c_fc,c_proj) use int5 (32 levels, [-16, 15]) instead of int6, which compresses ~15% better at the cost of a small precision penalty. This saves approximately 1.9MB relative to uniform int6, which directly funds the 11th transformer layer while staying within the 16MB artifact limit. Straight-through estimator (STE) fake-quantization is applied during training for both int6 and int5. The tied token embedding is kept in fp16. Compressed with zstandard at level 22.2. Late-K FP16 on Final Layer
The key projection (
c_k.weight) of the last transformer block is kept in fp16 rather than int6. This avoids quantization noise in the most context-sensitive attention keys, at a cost of ~131KB — a favorable quality/size trade-off.3. 11-Layer U-Net Architecture
The model uses 11 transformer blocks (5 encoder + 6 decoder) with U-Net skip connections, up from the 9-layer baseline. The 11th layer becomes feasible due to the byte savings from int5-MLP quantization.
4. Exclusive Self-Attention (XSA) on Last 4 Layers (PR #265)
XSA projects each value vector out of the attention output before the projection layer, preventing the model from trivially attending to each token's own value. Applied to the deepest 4 layers (
xsa_last_n=4). Implemented as an efficient GQA-aware subtraction with no repeat_interleave. Confirmed neutral vs 3 layers at seq_len=2048 with EMA (three independent runs). The implementation closely follows PR #265.5. EMA Weight Averaging, decay=0.997 (PR #287)
An exponential moving average of all model parameters is maintained throughout training:
ema = 0.997 * ema + 0.003 * param. The EMA weights are substituted into the model before the final quantization and evaluation. This replaces Stochastic Weight Averaging (SWA), which was found to exceed the 16MB artifact budget when combined with 11 layers and seq2048 batch tokens. EMA adds zero artifact size (the shadow copy is discarded after training) and provides smoother, better-regularized weights for quantization. Based on PR #287 (which uses the same decay=0.997).6. Sequence Length 2048
Training sequence length extended from 1024 to 2048. This increases gradient token count per step (same batch token budget), giving each update richer long-range context. The warmdown schedule is tuned to 2000 steps (rather than the 3500 default) to match the step budget at seq2048 training speed.
7. SmearGate + BigramHash(2048) + OrthoInit (PR #135)
Three techniques from PR #135 are included unchanged:
xor(36313*t[i], 27191*t[i-1]) % 2047. The bucket count is reduced to 2048 (from the PR Record: OrthoInit + Int6 MLP3x + BigramHash + SmearGate (val_bpb: 1.1539) #135 default of 4096) to fit the artifact budget with 11 layers. The embedding dimension is 64 rather than 128, maintaining the ~0.5MB FP16 footprint.nn.init.orthogonal_(gain=1.0). Output projections are additionally scaled by1/sqrt(2*num_layers)following muP conventions.8. Muon Optimizer with Weight Decay = 0.04 (PR #162, tuned)
PR #162 introduced decoupled weight decay for Muon. We tune the decay from the PR #162 default: sweep over {0.01, 0.02, 0.04, 0.1} shows 0.04 as optimal (+0.0006 BPB over 0.02; 0.1 catastrophic). Muon momentum is fixed at 0.95 with a warmup from 0.85 over 500 steps. RoPE base is 500K (confirmed −0.0036 BPB over the default 10K).
Hyperparameters
Results
Average val_bpb: 1.1361 (3 seeds)
Technique Attribution