openai · sabdulmajid · Apr 7, 2026
diff --git a/records/track_non_record_16mb/2026-04-07_MixedInt6_LZMA9_B3072_Warm5000/README.md b/records/track_non_record_16mb/2026-04-07_MixedInt6_LZMA9_B3072_Warm5000/README.md
@@ -0,0 +1,84 @@
+# Non-Record Submission: 1.20289664 BPB — Mixed-Int6 LZMA9 B3072 Warm5000
+
+**EMA + XSA(last-4) + BigramHash3072 + warmdown5000 + LeakyReLU^2 + mixed-int6 export**
+
+**val_bpb: 1.20289664** (sliding, seed=42) | **15,991,188 bytes** artifact | single-GPU unlimited-compute run (~16.1h)
+
+> **This is a non-record unlimited-compute submission.** Training ran for about 16.1 hours on a single GPU, so this is not a 10-minute leaderboard attempt. The main result is a stronger legal artifact than the listed 4-hour non-record baseline, using a known flat-transformer recipe plus longer single-GPU training and a broad mixed-int6/LZMA9 export path.
+
+## Result
+
+| Metric | Value |
+|--------|-------|
+| Sliding BPB | **1.20289664** |
+| Sliding val_loss | **1.99963255** |
+| Pre-quant sliding BPB | **1.16618894** |
+| Pre-quant sliding val_loss | **1.93861159** |
+| Steps | **16,000** |
+| Training time | **57,979.039s** (~16.1h) |
+| Artifact | **15,991,188 bytes** |
+| Code bytes | **110,016** |
+| Compressed model bytes | **15,881,172** |
+| Parameters | **27,124,828** |
+
+## Positioning
+
+This is **not** claiming a new SOTA technique. EMA, XSA, BigramHash, LeakyReLU^2, int6-style quantization, LZMA compression, and sliding evaluation are all established in prior submissions.
+
+The useful contribution is a non-record data point showing that the flat EMA/XSA/BigramHash family still improves under longer single-GPU training, and that a broad `mlp;attn;embed` mixed-int6 export with LZMA9 can keep the resulting 27.1M-parameter checkpoint legal under the 16MB artifact cap.
+
+| Reference | BPB | Notes |
+|-----------|----:|-------|
+| This submission | **1.20289664** | 16.1h single-GPU, non-record, 15,991,188 bytes |
+| 4-hour non-record baseline | 1.20737944 | Listed unlimited-compute baseline |
+| 1-bit non-record | 1.1239 | Stronger non-record result; this submission does not beat it |
+| Current 10-minute record | 1.11473509 | Stronger 10-minute SOTA; this submission is not a record attempt |
+
+## What Changed
+
+- **Longer training on the flat stack:** `BIGRAM_VOCAB_SIZE=3072`, XSA on the last 4 layers, EMA, LeakyReLU^2, and `WARMDOWN_ITERS=5000` produced a raw sliding score of **1.16618894 BPB** before legal export.
+- **Broad mixed-int6 export:** `QUANT_INT6_CATS=mlp;attn;embed`, `INT8_KEEP_FLOAT_MAX_NUMEL=32768`, and LZMA9 extreme compression produced a legal artifact at **15,991,188 bytes**.
+- **Export separation:** the raw checkpoint was preserved, then re-exported and evaluated independently. The export found **3,894,003** candidate `+-1` int6 entries, but the compressed artifact already fit the target byte cap, so no entries had to be pruned.
+
+## Training Configuration
+
+- 8 FineWeb SP1024 training shards
+- EMA enabled, decay `0.997`
+- XSA active on the last 4 layers
+- BigramHash `3072 x 128`
+- `leaky_relu2` activation with slope `0.5`
+- `TRAIN_BATCH_TOKENS=262144`
+- `TRAIN_SEQ_LEN=2048`
+- `ITERATIONS=16000`
+- `WARMDOWN_ITERS=5000`
+- `MAX_WALLCLOCK_SECONDS=64800`
+
+The exact training log is included in [train.log](./train.log).
+
+## Export Configuration
+
+- Eval-only export from the preserved raw checkpoint
+- `QUANT_SCHEME=mixed_int6`
+- `QUANT_INT6_CATS=mlp;attn;embed`
+- `INT8_KEEP_FLOAT_MAX_NUMEL=32768`
+- `QUANT_SELECTIVE_PRUNE=1`
+- `QUANT_TARGET_TOTAL_BYTES=16000000`
+- `QUANT_LZMA_PRESET=9`
+- `QUANT_LZMA_EXTREME=1`
+
+The exact export/eval log is included in [export_eval.log](./export_eval.log).
+
+## Files
+
+- `train_gpt.py`: exact script for this submission branch
+- `requirements.txt`: environment reference
+- `train.log`: training log
+- `export_eval.log`: export and final eval log
+
+## Compliance
+
+- [x] Artifact <= 16,000,000 bytes
+- [x] No test-time training on validation data
+- [x] No network calls during evaluation
+- [x] Self-contained script included
+- [x] Non-record unlimited-compute track
diff --git a/records/track_non_record_16mb/2026-04-07_MixedInt6_LZMA9_B3072_Warm5000/export_eval.log b/records/track_non_record_16mb/2026-04-07_MixedInt6_LZMA9_B3072_Warm5000/export_eval.log
@@ -0,0 +1,46 @@
+Starting run at 2026-04-07 07:12:18 UTC
+Run: exp8855_mi6_mae_k32_pr_lz9e_9336
+NPROC: 1
+MLP activation: leaky_relu2
+MLP leaky slope: 0.5
+MLP prelu init: 0.25
+MLP softplus beta: 1.0
+TTT optimizer: sgd
+TTT epoch schedule: fixed
+TTT chunk schedule: fixed
+TTT batch seqs: 32
+logs/exp8855_mi6_mae_k32_pr_lz9e_9336.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:8
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:131072 limited=1
+attention_backend:requested=sdpa flash_attn_available=0 compile_enabled=0 cuda_dtype=float16
+eval_only_checkpoint:<raw_checkpoint>/long16h_ema_b3072_warm5000_8855.final_model_raw.pt
+averaging_config ema_enabled=0 ema_decay=0.997 swa_enabled=0 swa_every=50 lawa_enabled=0 save_raw_checkpoint=0
+activation_config mlp_activation=leaky_relu2 mlp_leaky_slope=0.5 mlp_prelu_init=0.25 mlp_softplus_beta=1.0
+quant_config scheme=mixed_int6 int6_cats=mlp;attn;embed keep_float_max_numel=32768 keep_float16=none force_float16=none lzma_preset=9 extreme=1 selective_prune=1 target_total_bytes=16000000
+model_params:27124828
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+XSA:last_4 active_layers:[7, 8, 9, 10]
+world_size:1 grad_accum_steps:8
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:262144 train_seq_len:2048 iterations:300 warmup_steps:0 max_wallclock_seconds:1800.000
+seed:42
+step:300/300 val_loss:2.0278 val_bpb:1.2163 train_time:0ms step_avg:0.00ms
+peak memory allocated: 170 MiB reserved: 192 MiB
+DIAGNOSTIC pre_avg val_loss:2.0278 val_bpb:1.2163 eval_time:1912ms
+averaging:using raw training weights
+DIAGNOSTIC post_ema val_loss:2.0278 val_bpb:1.2163 eval_time:1914ms
+Serialized model: 106420662 bytes
+Code size: 110016 bytes
+selective_prune: 3894003 ±1 candidates unpruned=15991188 target=16000000
+selective_prune: already fits target
+Serialized model int6+lzma: 15881172 bytes
+Total submission size int6+lzma: 15991188 bytes
+final_int6_roundtrip val_loss:2.0361 val_bpb:1.2213 eval_time:1923ms
+final_int6_roundtrip_exact val_loss:2.03610863 val_bpb:1.22128751
+final_int6_sliding_window val_loss:1.9996 val_bpb:1.2029 stride:64 eval_time:52209ms
+final_int6_sliding_window_exact val_loss:1.99963255 val_bpb:1.20289664
+final_int8_zlib_roundtrip_exact val_loss:1.99963255 val_bpb:1.20289664
+Finished run at 2026-04-07 07:14:15 UTC
diff --git a/records/track_non_record_16mb/2026-04-07_MixedInt6_LZMA9_B3072_Warm5000/requirements.txt b/records/track_non_record_16mb/2026-04-07_MixedInt6_LZMA9_B3072_Warm5000/requirements.txt
@@ -0,0 +1,10 @@
+numpy
+tqdm
+torch
+huggingface-hub
+kernels
+setuptools
+typing-extensions==4.15.0
+datasets
+tiktoken
+sentencepiece
diff --git a/records/track_non_record_16mb/2026-04-07_MixedInt6_LZMA9_B3072_Warm5000/submission.json b/records/track_non_record_16mb/2026-04-07_MixedInt6_LZMA9_B3072_Warm5000/submission.json
@@ -0,0 +1,22 @@
+{
+  "author": "Ayman Abdul-Majid",
+  "github_id": "sabdulmajid",
+  "name": "Mixed-Int6 LZMA9 B3072 Warm5000",
+  "blurb": "Unlimited-compute non-record submission: a 16k-step EMA + XSA(last-4) + BigramHash3072 SP1024 run with longer warmdown, exported legally with mixed-int6 over mlp/attn/embed plus LZMA9 extreme compression. Pre-quant sliding BPB 1.1662, final legal artifact 1.20289664 at 15,991,188 bytes.",
+  "date": "2026-04-07T07:14:15Z",
+  "track": "non-record-unlimited-compute-16mb",
+  "val_loss": 1.99963255,
+  "val_bpb": 1.20289664,
+  "pre_quant_val_loss": 1.93861159,
+  "pre_quant_val_bpb": 1.16618894,
+  "step_stop": 16000,
+  "wallclock_seconds": 57979.039,
+  "bytes_total": 15991188,
+  "bytes_model_int6_lzma": 15881172,
+  "bytes_code": 110016,
+  "hardware": "single-GPU training and single-GPU export/eval",
+  "quant_scheme": "mixed_int6",
+  "quant_int6_cats": "mlp;attn;embed",
+  "quant_lzma_preset": 9,
+  "quant_lzma_extreme": true
+}