openai · ibarrajo · Apr 1, 2026
diff --git a/records/track_10min_16mb/2026-04-01_ApproachG_SelfGenGPTQ/README.md b/records/track_10min_16mb/2026-04-01_ApproachG_SelfGenGPTQ/README.md
@@ -0,0 +1,49 @@
+# Approach G: AR Self-Generated GPTQ Calibration
+
+**Target: val_bpb < 1.12** | 8xH100 SXM, 600s | < 16MB artifact
+
+## Technique
+
+After training completes, the model autoregressively generates its own calibration data (64 sequences x 2048 tokens, temperature=0.8, fixed seed). These self-generated sequences are used to collect Hessians H = X^T X for full GPTQ quantization with Cholesky error compensation and column reordering.
+
+**Why this is legal:** No training data and no validation data are accessed during quantization. The calibration data is entirely self-generated by the model. This approach was confirmed legal by the competition organizer on PR #1019.
+
+**Why this is better than training-data calibration:** The self-generated text better represents the model's actual activation distribution at inference time, leading to lower quantization error. PR #1019 achieved 1.1147 BPB with this technique (current SOTA).
+
+## Base
+
+Built on Approach B (Int5 GPTQ + 33.6M params + TTT), replacing the training-data GPTQ calibration with AR self-generated calibration.
+
+## Key Changes from Approach B
+
+1. **AR Self-Gen Calibration**: `generate_autoregressive_calib()` generates 64 sequences of 2048 tokens at temp=0.8
+2. **Hessian Collection from Generated Data**: `collect_hessians_from_tokens()` collects H = X^T X from self-generated sequences
+3. **Training Time Budget**: `max_wallclock_seconds` reduced from 590s to 390s to reserve ~210s for AR generation + Hessian collection
+
+## Architecture
+
+- 11 layers, dim=512, 8 heads, 8 KV heads
+- BigramHash embedding (6144 x 128)
+- Value embeddings on layers 9,10
+- XSA on last 11 layers
+- SmearGate, U-Net skip connections
+- ReLU^2 MLP (3.5x width)
+
+## Quantization
+
+- Int6 GPTQ for attention + MLP weights (Cholesky error compensation, column reordering)
+- Int8 for remaining weights
+- zstd level 22 compression
+- 10% magnitude pruning pre-quantization
+
+## Eval
+
+- Sliding window eval (stride=64)
+- Score-first TTT (3 epochs, lr=0.0001, AdamW)
+
+## Run
+
+```bash
+NCCL_IB_DISABLE=1 RUN_ID=approach_g SEED=1337 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee run.log
+```
diff --git a/records/track_10min_16mb/2026-04-01_ApproachG_SelfGenGPTQ/run.sh b/records/track_10min_16mb/2026-04-01_ApproachG_SelfGenGPTQ/run.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+# Approach G: AR Self-Gen GPTQ
+# Run on 8xH100 SXM within 600s budget
+set -euo pipefail
+
+export NCCL_IB_DISABLE=1
+export RUN_ID="${RUN_ID:-approach_g}"
+export SEED="${SEED:-1337}"
+
+# Model architecture
+export NUM_LAYERS=11
+export MODEL_DIM=512
+export NUM_HEADS=8
+export NUM_KV_HEADS=8
+export VOCAB_SIZE=1024
+export TRAIN_SEQ_LEN=2048
+export EVAL_SEQ_LEN=2048
+export BIGRAM_VOCAB_SIZE=6144
+export BIGRAM_DIM=128
+export XSA_LAST_N=11
+export VE_ENABLED=1
+export VE_DIM=128
+export VE_LAYERS="9,10"
+
+# Optimizer
+export MUON_WD=0.04
+export ADAM_WD=0.04
+export MATRIX_LR=0.025
+export SCALAR_LR=0.025
+export TIED_EMBED_LR=0.035
+export MUON_MOMENTUM=0.99
+export MUON_MOMENTUM_WARMUP_START=0.92
+export MUON_MOMENTUM_WARMUP_STEPS=1500
+export WARMDOWN_ITERS=3500
+
+# Training budget: 390s train + ~210s AR self-gen GPTQ = 600s total
+export MAX_WALLCLOCK_SECONDS=390
+export ITERATIONS=20000
+
+# Eval
+export EVAL_STRIDE=64
+export SWA_ENABLED=1
+
+# Pruning
+export PRUNE_PCT=0.10
+
+# TTT
+export TTT_EPOCHS=3
+export TTT_LR=0.0001
+export TTT_FREEZE_BLOCKS=2
+export TTT_CHUNK_TOKENS=131072
+export TTT_OPTIMIZER=adamw
+
+torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee "/workspace/run_${RUN_ID}.log"
diff --git a/records/track_10min_16mb/2026-04-01_ApproachG_SelfGenGPTQ/submission.json b/records/track_10min_16mb/2026-04-01_ApproachG_SelfGenGPTQ/submission.json
@@ -0,0 +1,27 @@
+{
+  "author": "elninja",
+  "github_id": "elninja",
+  "name": "AR Self-Gen GPTQ + Int6 + XSA + TTT",
+  "blurb": "11L model with AR self-generated GPTQ calibration (64 seqs x 2048 tokens, temp=0.8). Model generates its own calibration data after training -- no training or validation data accessed during quantization. Full Hessian GPTQ with Cholesky error compensation. BigramHash 6144x128, value embeddings, sliding window eval, score-first TTT.",
+  "date": "2026-04-01",
+  "track": "10min_16mb",
+  "val_loss": null,
+  "val_bpb": null,
+  "bytes_total": null,
+  "bytes_code": null,
+  "gpu": "8xH100 SXM",
+  "training_time_seconds": 600,
+  "num_layers": 11,
+  "model_dim": 512,
+  "num_heads": 8,
+  "num_kv_heads": 8,
+  "vocab_size": 1024,
+  "train_seq_len": 2048,
+  "eval_seq_len": 2048,
+  "bigram_vocab_size": 6144,
+  "bigram_dim": 128,
+  "quantization": "int6 GPTQ (AR self-gen calibration, 64 seqs x 2048, temp=0.8)",
+  "compression": "zstd level 22",
+  "calibration": "AR self-generated (no external data during quantization)",
+  "technique_summary": "AR self-gen GPTQ calibration replaces training-data calibration for legal compliance"
+}