openai · raahilshah · Mar 24, 2026 · Mar 26, 2026
diff --git a/records/track_10min_16mb/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171/README.md b/records/track_10min_16mb/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171/README.md
@@ -0,0 +1,89 @@
+# Record: 11L XSA-all + Full GPTQ + Parallel Muon + Selective Pruning
+
+**val_bpb: 1.1171** (3-seed mean, std 0.0006) | **15.92 MB** max artifact | 8xH100 SXM, 600s
+
+## Results (3 seeds, 8xH100 SXM)
+
+| Seed | Steps | ms/step | Sliding BPB (s64) | val_loss | Artifact |
+|------|-------|---------|--------------------|----------|----------|
+| 1337 | ~7,100 | 84.2 | **1.1164** | 1.8851 | 15,920,050 bytes |
+| 42 | ~7,100 | 84.2 | 1.1171 | 1.8861 | 15,921,954 bytes |
+| 7 | ~7,100 | 84.2 | 1.1177 | 1.8871 | 15,914,654 bytes |
+
+**Mean: 1.1171 | Std: 0.0006**
+
+## Key Techniques
+
+### XSA on All 11 Layers
+Standard practice applies Exclusive Self-Attention to only the last 4 layers. Applying to all 11 forces cross-position information mixing from layer 0, improving representation quality. Zero new parameters — just a config change. -0.0016 BPB vs XSA-last-4 in ablation.
+
+### Full Hessian GPTQ with amax-aligned QAT
+- 256-sample calibration from training data for per-layer Hessian approximation
+- Column-wise int6 quantization with Cholesky error compensation, block size 128
+- QAT STE aligned to export quantizer using row-maximum (amax) clipping with [-32, 31] range
+- Late QAT at threshold 0.15
+
+### Parallel Muon Optimizer with Parameter Banking
+- Weight matrices stored in contiguous parameter banks (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank)
+- 3-phase overlapped optimizer step: async reduce-scatter → batched Newton-Schulz orthogonalization → async all-gather
+- Eliminates DDP double-communication overhead, achieving 84.2ms/step (~7,100 steps in 600s)
+
+### Selective ±1 Magnitude Pruning
+Post-GPTQ, sort quantized values at ±1 by reconstruction error (scale²), zero least-impactful first until artifact fits target. Binary search for exact target size. Targets only values whose removal causes minimal reconstruction damage.
+
+### LZMA Compression
+LZMA preset 6 replacing zstd-22 for model serialization. Better compression ratio on int6 quantized weights.
+
+## Architecture
+
+- 11 transformer layers, dim=512, 8 heads, 4 KV heads (GQA)
+- 3x MLP expansion (hidden=1536) with **LeakyReLU(0.5)²** activation
+- **XSA on all 11 layers** (Exclusive Self-Attention)
+- Partial RoPE (16/64 dims) + NTK-aware scaling
+- LN Scale Factor 1/sqrt(layer_idx+1)
+- U-Net skip connections (5 encoder, 6 decoder)
+- SmearGate temporal gating
+- BigramHash (2048 buckets, 128-dim)
+- Shared Value Embedding (dim=128, layers 9-10)
+- FlashAttention 3 (Hopper native kernels)
+- Orthogonal init, logit softcap 30, tied embeddings
+
+## Training
+
+- Parallel Muon optimizer (matrices): lr=0.025, momentum=0.99, WD=0.04, 5 Newton-Schulz steps
+- AdamW (embeddings): lr=0.035, (scalars): lr=0.025, WD=0.04
+- Gradient clip: 0.3
+- Batch: 786,432 tokens/step, seq_len=2048
+- Warmdown: 3,500 iters (wallclock-based)
+- EMA (decay=0.997) + Tight SWA (every 50 steps, scale<0.2)
+- Late QAT: STE int6 fake-quantization when LR scale<0.15
+
+## Quantization & Compression
+
+- Full GPTQ with 256-sample Hessian calibration, block_size=128, percdamp=0.01
+- Int6 per-row with amax clipping, range [-32, 31]
+- Selective ±1 magnitude pruning (target 15.9MB)
+- Small tensors + tok_emb.weight in fp16
+- LZMA preset 6 compression
+
+## Requirements
+
+```bash
+pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
+pip install zstandard sentencepiece
+```
+
+## Run Command
+
+```bash
+SEED=1337 TARGET_MB=15.9 torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Test Plan
+
+- [x] 3 seeds run on 8xH100 SXM
+- [x] All 3 seeds train in ≤600s
+- [x] All 3 seeds artifact ≤16,000,000 bytes (max: 15,921,954)
+- [x] Sliding window eval stride=64, consistent (std=0.0006)
+- [x] No test-time training on validation data
+- [x] No network calls during evaluation
diff --git a/records/track_10min_16mb/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171/submission.json b/records/track_10min_16mb/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171/submission.json
@@ -0,0 +1,19 @@
+{
+  "author": "Raahil Shah",
+  "github_id": "raahilshah",
+  "name": "11L XSA-all + Full GPTQ + Parallel Muon + LZMA + Selective Pruning",
+  "blurb": "XSA on all 11 layers, Hessian-aware GPTQ with amax-aligned QAT, Parallel Muon optimizer with parameter banking, LZMA compression, selective ±1 magnitude pruning. LeakyReLU(0.5)² activation, EMA(0.997), Tight SWA, VE128, Partial RoPE 16/64, LN Scale, BigramHash(2048), U-Net skips.",
+  "date": "2026-03-24T00:00:00Z",
+  "val_loss": 1.88609770,
+  "val_bpb": 1.11705625,
+  "pre_quant_val_loss": 1.9210,
+  "pre_quant_val_bpb": 1.1386,
+  "bytes_total": 15921954,
+  "seeds": {
+    "1337": {"val_bpb": 1.11643730, "val_loss": 1.88505263, "bytes_total": 15920050},
+    "42": {"val_bpb": 1.11708034, "val_loss": 1.88613838, "bytes_total": 15921954},
+    "7": {"val_bpb": 1.11765111, "val_loss": 1.88710208, "bytes_total": 15914654}
+  },
+  "mean_val_bpb": 1.11705625,
+  "std_val_bpb": 0.00060726
+}
diff --git a/records/track_10min_16mb/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171/train.log b/records/track_10min_16mb/2026-03-24_11L_XSA-all_GPTQ_ParallelMuon_1.1171/train.log
@@ -0,0 +1,87 @@
+logs/autorun_1774351988_exp93.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:26993756
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9309 val_bpb:4.1049 train_time:0ms step_avg:0.02ms
+step:1/20000 train_loss:6.9317 train_time:508ms step_avg:507.92ms
+step:2/20000 train_loss:8.6935 train_time:551ms step_avg:275.26ms
+step:3/20000 train_loss:7.5958 train_time:634ms step_avg:211.21ms
+step:4/20000 train_loss:7.3349 train_time:719ms step_avg:179.72ms
+step:5/20000 train_loss:7.2640 train_time:802ms step_avg:160.47ms
+step:6/20000 train_loss:7.1177 train_time:888ms step_avg:148.02ms
+step:7/20000 train_loss:6.9213 train_time:973ms step_avg:138.95ms
+step:8/20000 train_loss:6.8020 train_time:1055ms step_avg:131.88ms
+step:9/20000 train_loss:6.4159 train_time:1140ms step_avg:126.64ms
+step:10/20000 train_loss:6.0442 train_time:1225ms step_avg:122.47ms
+step:500/20000 train_loss:2.3884 train_time:43397ms step_avg:86.79ms
+step:1000/20000 train_loss:2.2588 train_time:86679ms step_avg:86.68ms
+step:1500/20000 train_loss:2.2048 train_time:130091ms step_avg:86.73ms
+step:2000/20000 train_loss:2.0491 train_time:173631ms step_avg:86.82ms
+step:2500/20000 train_loss:2.1520 train_time:217233ms step_avg:86.89ms
+step:3000/20000 train_loss:2.1463 train_time:260825ms step_avg:86.94ms
+step:3500/20000 train_loss:2.1626 train_time:304413ms step_avg:86.98ms
+step:4000/20000 train_loss:1.9592 train_time:348016ms step_avg:87.00ms
+step:4000/20000 val_loss:2.0459 val_bpb:1.2117 train_time:348059ms step_avg:87.01ms
+step:4500/20000 train_loss:2.1078 train_time:391582ms step_avg:87.02ms
+step:5000/20000 train_loss:2.0857 train_time:435127ms step_avg:87.03ms
+step:5500/20000 train_loss:1.9993 train_time:478709ms step_avg:87.04ms
+step:6000/20000 train_loss:1.9235 train_time:522228ms step_avg:87.04ms
+swa:start step:6200
+late_qat:enabled step:6365 scale:0.1498
+step:6500/20000 train_loss:2.0609 train_time:566145ms step_avg:87.10ms
+step:6884/20000 val_loss:1.9207 val_bpb:1.1375 train_time:600081ms step_avg:87.17ms
+stopping_early: wallclock_cap train_time:600081ms step:6884/20000
+peak memory allocated: 22851 MiB reserved: 23004 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:1.9190 val_bpb:1.1365 eval_time:2083ms
+Serialized model: 106158518 bytes
+Code size: 87330 bytes
+gptq:building calibration model...
+gptq:calibrating with 256 full training batches...
+gptq:calibrated 68 layers in 46.5s
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+gptq_quantize: 66 GPTQ layers, 0 naive layers
+selective_prune: 4098820 ±1 candidates, unpruned=15.18MB target=15.9MB
+Serialized model int6+lzma: 15832720 bytes
+Total submission size int6+lzma: 15920050 bytes
+Total submission size int8+zlib: 15920050 bytes
+final_int6_roundtrip val_loss:1.9247 val_bpb:1.1399 eval_time:14661ms
+final_int6_roundtrip_exact val_loss:1.92469650 val_bpb:1.13991368
+final_int6_sliding_window val_loss:1.8851 val_bpb:1.1164 stride:64 eval_time:85035ms
+final_int6_sliding_window_exact val_loss:1.88505263 val_bpb:1.11643730
+final_int8_zlib_roundtrip_exact val_loss:1.88505263 val_bpb:1.11643730