openai · vermissa0ss · Apr 1, 2026 · Apr 1, 2026
diff --git a/records/track_non_record_16mb/2026-04-01_ar_gptq_xsa_rotq_hadamard/README.md b/records/track_non_record_16mb/2026-04-01_ar_gptq_xsa_rotq_hadamard/README.md
@@ -0,0 +1,136 @@
+# AR Self-Gen GPTQ + XSA-all + Per-Layer Hadamard ROTQ
+
+**Non-record submission.**
+
+This run uses the same evaluation metric and artifact accounting as the main leaderboard, but it is **not** a main-track record submission because it was trained on **1xH100 for 4800 seconds**, not reproduced under the official **8xH100 SXM / 600 second** budget.
+
+## Summary
+
+This submission starts from the current public `11L AR Self-Gen GPTQ + XSA` stack and adds a small, modular quantization-basis change:
+
+- keep the runtime model standard dense linear layers after load
+- choose a compact right-rotation for large MLP matrices before GPTQ
+- quantize in the rotated basis
+- invert the rotation after dequantization during roundtrip evaluation
+
+The best variant here is simple:
+
+- Hadamard right-rotation
+- applied to `mlp_up` and `mlp_down`
+- block size chosen per layer from `{128, 256, 512}`
+
+On this longer single-GPU run, that export path slightly improves the exact final sliding score over the unrotated export on the same checkpoint while staying under the 16,000,000-byte cap.
+
+## Result
+
+Reference values from the current upstream README at submission time:
+
+- Main leaderboard target: `1.1147`
+- Naive baseline: `1.2244`
+
+Best result in this folder:
+
+- Exact sliding `val_bpb`: `1.11290586`
+- Exact sliding `val_loss`: `1.87908996`
+- Total artifact bytes: `15,826,148`
+- Hardware / duration: `1xH100`, `4800s`, seed `314`
+
+Because this is a non-record run, the correct interpretation is:
+
+- same metric as the leaderboard
+- numerically better than the current top README score
+- not eligible for the main track until reproduced within the official budget
+
+## Ablation Table
+
+All rows below use the same trained checkpoint and differ only in the export path.
+
+| Export path | Exact sliding val_bpb | Bytes total |
+|---|---:|---:|
+| Base AR self-gen GPTQ export | `1.11296252` | `15,869,601` |
+| `mlp_down` Hadamard `256` | `1.11291713` | `15,825,000` |
+| `mlp_down` per-layer Hadamard | `1.11290938` | `15,826,480` |
+| `mlp_up + mlp_down` per-layer Hadamard | `1.11290586` | `15,826,148` |
+
+Net gain from the best rotation-aware export over the base export on the same checkpoint:
+
+- `-0.00005666` BPB
+
+## What Changed
+
+Relative to the public AR self-generated GPTQ + XSA-all baseline stack, this version adds:
+
+1. Rotation-aware GPTQ hooks for selected 2D tensors
+2. Blockwise Walsh-Hadamard right-rotations before GPTQ
+3. Shared or per-layer rotation selection over semantic tensor groups
+4. Roundtrip reconstruction by inverting the right-rotation after dequantization
+
+The implementation stays modular and optional through environment flags. The best result here came from forcing per-layer Hadamard search on `mlp_up` and `mlp_down`.
+
+## Run Commands
+
+Training run used for the checkpoint:
+
+```bash
+RUN_ID=ar_xsa_base_long4800_seed314 \
+DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/ \
+TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
+VOCAB_SIZE=1024 \
+BIGRAM_VOCAB_SIZE=3072 \
+BIGRAM_DIM=112 \
+WARMDOWN_ITERS=4000 \
+TARGET_MB=15.9 \
+SEED=314 \
+MAX_WALLCLOCK_SECONDS=4800 \
+ROTQ_ENABLED=0 \
+EVAL_FORCE_S64=1 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+Best export-only ablation on the resulting checkpoint:
+
+```bash
+RUN_ID=ar_export_rotq_mlpupdown_perlayer_had_long4800_fullcal_slide64 \
+INIT_MODEL_PATH=/workspace/parameter-golf/checkpoints/ar_xsa_base_long4800_seed314_final_model.pt \
+DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/ \
+TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
+VOCAB_SIZE=1024 \
+BIGRAM_VOCAB_SIZE=3072 \
+BIGRAM_DIM=112 \
+WARMDOWN_ITERS=4000 \
+TARGET_MB=15.9 \
+SEED=314 \
+MAX_WALLCLOCK_SECONDS=0 \
+EVAL_FORCE_S64=1 \
+ROTQ_ENABLED=1 \
+ROTQ_FORCE=1 \
+ROTQ_MODE=hadamard \
+ROTQ_TARGETS=mlp_down,mlp_up \
+ROTQ_BLOCK_SIZES=128,256,512 \
+ROTQ_GROUP_SHARE=0 \
+ROTQ_SCORE_TENSORS=6 \
+ROTQ_SCORE_ROWS=256 \
+torchrun --standalone --nproc_per_node=1 train_gpt.py
+```
+
+## Compliance Notes
+
+- No validation leakage during training
+- GPTQ calibration data is autoregressively self-generated by the trained model
+- No network calls are required during evaluation
+- Artifact remains under the decimal `16,000,000` byte cap
+- This is explicitly a **non-record** submission because compute exceeds the main-track budget
+
+## Included Files
+
+- `train_gpt.py`: exact script used for the best export result in this folder
+- `train.log`: recovered copy of the auto-generated log for the long training run
+- `rotq_ablation.log`: recovered copy of the export-only ablation log
+- `submission.json`: metadata for this non-record submission
+
+## Limitations
+
+- Single-seed evidence only
+- Not reproduced on `8xH100 SXM` within the `600s` training budget
+- The pod stopped before raw log sync completed, so the attached logs are recovered copies of the exact auto-generated log content captured from the pod session
+- The ROTQ gain is real but small; the main value of this PR is the new modular export idea and the evidence that it composes with the current strong AR-GPTQ/XSA stack
diff --git a/records/track_non_record_16mb/2026-04-01_ar_gptq_xsa_rotq_hadamard/requirements.txt b/records/track_non_record_16mb/2026-04-01_ar_gptq_xsa_rotq_hadamard/requirements.txt
@@ -0,0 +1,3 @@
+sentencepiece
+zstandard
+flash_attn_3
diff --git a/records/track_non_record_16mb/2026-04-01_ar_gptq_xsa_rotq_hadamard/rotq_ablation.log b/records/track_non_record_16mb/2026-04-01_ar_gptq_xsa_rotq_hadamard/rotq_ablation.log
@@ -0,0 +1,93 @@
+starting ar_export_rotq_mlpdown_had256_long4800_fullcal_slide64
+world_size:1 grad_accum_steps:8
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_kernel:flashattn3
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:0 warmup_steps:0 max_wallclock_seconds:0.000
+seed:314
+step:0/0 val_loss:1.9122 val_bpb:1.1325 train_time:0ms step_avg:0.02ms
+peak memory allocated: 1002 MiB reserved: 1242 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:1.9122 val_bpb:1.1325 eval_time:16429ms
+Serialized model: 106289590 bytes
+Code size: 119252 bytes
+gptq:building non-banked model for Hessian collection...
+gptq:generating autoregressive calibration data (64 seqs x 2048 tokens, temp=0.8)...
+gptq:generated 64 sequences in 199.6s
+gptq:collecting hessians from autoregressive data...
+gptq:collected hessians for 68 layers (AR self-gen)
+rotq:enabled groups:1 active:11 meta_bytes:264
+rotq:mlp_down:1536 group:mlp_down mode:hadamard block:256 score:1.564798e-01
+selective_prune: 4217740 ±1 candidates, unpruned=15.09MB target=15.9MB
+selective_prune: already fits, no pruning needed
+Serialized model int6+lzma: 15705748 bytes
+Rotation metadata bytes (logical): 264
+Total submission size int6+lzma: 15825000 bytes
+final_int6_roundtrip val_loss:1.9190 val_bpb:1.1365 eval_time:16283ms
+final_int6_roundtrip_exact val_loss:1.91901076 val_bpb:1.13654627
+final_int6_sliding_window val_loss:1.8791 val_bpb:1.1129 stride:64 eval_time:595377ms
+final_int6_sliding_window_exact val_loss:1.87910898 val_bpb:1.11291713
+final_int8_zlib_roundtrip_exact val_loss:1.87910898 val_bpb:1.11291713
+starting ar_export_rotq_mlpdown_perlayer_had_long4800_fullcal_slide64
+DIAGNOSTIC post_ema val_loss:1.9122 val_bpb:1.1325 eval_time:16381ms
+Serialized model: 106289590 bytes
+Code size: 119252 bytes
+gptq:building non-banked model for Hessian collection...
+gptq:generating autoregressive calibration data (64 seqs x 2048 tokens, temp=0.8)...
+gptq:generated 64 sequences in 198.9s
+gptq:collecting hessians from autoregressive data...
+gptq:collected hessians for 68 layers (AR self-gen)
+rotq:enabled groups:11 active:11 meta_bytes:264
+rotq:blocks.0.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.569647e-01
+rotq:blocks.1.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.561793e-01
+rotq:blocks.2.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.568002e-01
+rotq:blocks.3.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.558599e-01
+rotq:blocks.4.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.558491e-01
+rotq:blocks.5.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.564744e-01
+rotq:blocks.6.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.560896e-01
+rotq:blocks.7.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.559215e-01
+rotq:blocks.8.mlp.proj.weight group:mlp_down mode:hadamard block:256 score:1.554852e-01
+rotq:blocks.9.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.563086e-01
+rotq:blocks.10.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.571935e-01
+selective_prune: 4216026 ±1 candidates, unpruned=15.09MB target=15.9MB
+selective_prune: already fits, no pruning needed
+Serialized model int6+lzma: 15707228 bytes
+Rotation metadata bytes (logical): 264
+Total submission size int6+lzma: 15826480 bytes
+final_int6_roundtrip val_loss:1.9190 val_bpb:1.1365 eval_time:16266ms
+final_int6_roundtrip_exact val_loss:1.91899953 val_bpb:1.13653961
+final_int6_sliding_window val_loss:1.8791 val_bpb:1.1129 stride:64 eval_time:592103ms
+final_int6_sliding_window_exact val_loss:1.87909590 val_bpb:1.11290938
+final_int8_zlib_roundtrip_exact val_loss:1.87909590 val_bpb:1.11290938
+starting ar_export_rotq_mlpupdown_perlayer_had_long4800_fullcal_slide64
+rotq:blocks.2.mlp.fc.weight group:mlp_up mode:hadamard block:128 score:1.543067e-01
+rotq:blocks.3.mlp.fc.weight group:mlp_up mode:hadamard block:512 score:1.542510e-01
+rotq:blocks.4.mlp.fc.weight group:mlp_up mode:hadamard block:512 score:1.528345e-01
+rotq:blocks.5.mlp.fc.weight group:mlp_up mode:hadamard block:128 score:1.565601e-01
+rotq:blocks.6.mlp.fc.weight group:mlp_up mode:hadamard block:256 score:1.545500e-01
+rotq:blocks.7.mlp.fc.weight group:mlp_up mode:hadamard block:256 score:1.552076e-01
+rotq:blocks.8.mlp.fc.weight group:mlp_up mode:hadamard block:512 score:1.547106e-01
+rotq:blocks.9.mlp.fc.weight group:mlp_up mode:hadamard block:128 score:1.537057e-01
+rotq:blocks.10.mlp.fc.weight group:mlp_up mode:hadamard block:128 score:1.545073e-01
+rotq:blocks.0.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.569647e-01
+rotq:blocks.1.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.561793e-01
+rotq:blocks.2.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.568002e-01
+rotq:blocks.3.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.558599e-01
+rotq:blocks.4.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.558491e-01
+rotq:blocks.5.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.564744e-01
+rotq:blocks.6.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.560896e-01
+rotq:blocks.7.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.559215e-01
+rotq:blocks.8.mlp.proj.weight group:mlp_down mode:hadamard block:256 score:1.554852e-01
+rotq:blocks.9.mlp.proj.weight group:mlp_down mode:hadamard block:512 score:1.563086e-01
+rotq:blocks.10.mlp.proj.weight group:mlp_down mode:hadamard block:128 score:1.571935e-01
+selective_prune: 4208001 ±1 candidates, unpruned=15.09MB target=15.9MB
+selective_prune: already fits, no pruning needed
+Serialized model int6+lzma: 15706896 bytes
+Rotation metadata bytes (logical): 506
+Total submission size int6+lzma: 15826148 bytes
+final_int6_roundtrip val_loss:1.9190 val_bpb:1.1365 eval_time:16322ms
+final_int6_roundtrip_exact val_loss:1.91899332 val_bpb:1.13653594
+final_int6_sliding_window val_loss:1.8791 val_bpb:1.1129 stride:64 eval_time:596342ms
+final_int6_sliding_window_exact val_loss:1.87908996 val_bpb:1.11290586
+final_int8_zlib_roundtrip_exact val_loss:1.87908996 val_bpb:1.11290586
diff --git a/records/track_non_record_16mb/2026-04-01_ar_gptq_xsa_rotq_hadamard/submission.json b/records/track_non_record_16mb/2026-04-01_ar_gptq_xsa_rotq_hadamard/submission.json
@@ -0,0 +1,12 @@
+{
+  "track": "non_record_16mb",
+  "date": "2026-04-01",
+  "name": "AR Self-Gen GPTQ + XSA-all + Per-Layer Hadamard ROTQ",
+  "author": "Arash",
+  "github_id": "vermissa0ss",
+  "val_bpb": 1.11290586,
+  "val_loss": 1.87908996,
+  "bytes_total": 15826148,
+  "hardware": "1xH100 80GB",
+  "blurb": "Non-record submission. Starts from the public AR self-generated GPTQ + XSA-all + BigramHash stack and adds rotation-aware GPTQ with per-layer Hadamard right-rotations on MLP matrices. Best export on a 1xH100/4800s run reached 1.11290586 sliding bpb at 15,826,148 bytes."
+}
diff --git a/records/track_non_record_16mb/2026-04-01_ar_gptq_xsa_rotq_hadamard/train.log b/records/track_non_record_16mb/2026-04-01_ar_gptq_xsa_rotq_hadamard/train.log
@@ -0,0 +1,80 @@
+model_params:27067484
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+world_size:1 grad_accum_steps:8
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_kernel:flashattn3
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:4800.000
+seed:314
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9271 val_bpb:4.1026 train_time:0ms step_avg:0.02ms
+step:1/20000 train_loss:6.9279 train_time:652ms step_avg:652.22ms
+step:2/20000 train_loss:8.6225 train_time:1243ms step_avg:621.68ms
+step:3/20000 train_loss:7.6539 train_time:1904ms step_avg:634.55ms
+step:4/20000 train_loss:7.3319 train_time:2548ms step_avg:636.98ms
+step:5/20000 train_loss:7.0682 train_time:3200ms step_avg:639.99ms
+step:6/20000 train_loss:7.0074 train_time:3847ms step_avg:641.21ms
+step:7/20000 train_loss:6.9721 train_time:4493ms step_avg:641.83ms
+step:8/20000 train_loss:6.7644 train_time:5146ms step_avg:643.20ms
+step:9/20000 train_loss:6.4777 train_time:5793ms step_avg:643.68ms
+step:10/20000 train_loss:6.1204 train_time:6441ms step_avg:644.09ms
+step:500/20000 train_loss:2.3192 train_time:325737ms step_avg:651.47ms
+step:1000/20000 train_loss:2.2598 train_time:651864ms step_avg:651.86ms
+step:1500/20000 train_loss:2.1287 train_time:978338ms step_avg:652.23ms
+step:2000/20000 train_loss:2.0470 train_time:1305273ms step_avg:652.64ms
+step:2500/20000 train_loss:2.0920 train_time:1632135ms step_avg:652.85ms
+step:3000/20000 train_loss:2.0749 train_time:1959198ms step_avg:653.07ms
+step:3500/20000 train_loss:2.0674 train_time:2286291ms step_avg:653.23ms
+step:4000/20000 train_loss:2.1350 train_time:2613343ms step_avg:653.34ms
+step:4000/20000 val_loss:2.0439 val_bpb:1.2105 train_time:2613392ms step_avg:653.35ms
+step:4500/20000 train_loss:2.1269 train_time:2940377ms step_avg:653.42ms
+step:5000/20000 train_loss:2.0362 train_time:3267311ms step_avg:653.46ms
+step:5500/20000 train_loss:2.0406 train_time:3594311ms step_avg:653.51ms
+step:6000/20000 train_loss:1.9402 train_time:3921190ms step_avg:653.53ms
+step:6500/20000 train_loss:2.0445 train_time:4247900ms step_avg:653.52ms
+swa:start step:6550
+late_qat:enabled step:6744 scale:0.1498
+step:7000/20000 train_loss:1.8408 train_time:4576477ms step_avg:653.78ms
+step:7341/20000 val_loss:1.9137 val_bpb:1.1334 train_time:4800390ms step_avg:653.91ms
+stopping_early: wallclock_cap train_time:4800390ms step:7341/20000
+peak memory allocated: 23002 MiB reserved: 23184 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:1.9121 val_bpb:1.1325 eval_time:16505ms
+Serialized model: 106289590 bytes
+Code size: 118057 bytes
+gptq:building non-banked model for Hessian collection...
+gptq:generating autoregressive calibration data (64 seqs x 2048 tokens, temp=0.8)...
+gptq:generated 64 sequences in 204.7s
+gptq:collecting hessians from autoregressive data...
+gptq:collected hessians for 68 layers (AR self-gen)
+selective_prune: 4217763 ±1 candidates, unpruned=15.13MB target=15.9MB
+selective_prune: already fits, no pruning needed
+Serialized model int6+lzma: 15751544 bytes
+Rotation metadata bytes (logical): 0
+Total submission size int6+lzma: 15869601 bytes
+final_int6_roundtrip val_loss:1.9191 val_bpb:1.1366 eval_time:20641ms
+final_int6_roundtrip_exact val_loss:1.91911329 val_bpb:1.13660699
+final_int6_sliding_window val_loss:1.8792 val_bpb:1.1130 stride:64 eval_time:591478ms
+final_int6_sliding_window_exact val_loss:1.87918563 val_bpb:1.11296252
+final_int8_zlib_roundtrip_exact val_loss:1.87918563 val_bpb:1.11296252