|
| 1 | +I want your help competing in Parameter Golf. Our target is to get to 1.0 BPB. The submission must be fully legal under the rules at https://github.com/openai/parameter-golf |
| 2 | + |
| 3 | +You are continuing from prior work. Use the existing project memory and codebase carefully. |
| 4 | + |
| 5 | +The backward-looking TTT constraint is critical and non-negotiable: the model may never train on a token before that token has already been scored. |
| 6 | + |
| 7 | +## Current Best |
| 8 | + |
| 9 | +**1.0541 BPB** (3-seed mean), achieved with 5-expert Hedge Mixer + CROWN-Q + TTT. |
| 10 | + |
| 11 | +| Seed | BPB | Artifact | Eval time | |
| 12 | +|------|-----|----------|-----------| |
| 13 | +| 1337 | 1.0473 | 15.89 MB | 336s | |
| 14 | +| 42 | 1.0686 | 15.69 MB | 336s | |
| 15 | +| 7 | 1.0465 | 15.66 MB | 336s | |
| 16 | + |
| 17 | +Code: `submission-2026-03-25/train_gpt.py` (97KB) |
| 18 | + |
| 19 | +## What We Tried and What Worked |
| 20 | + |
| 21 | +### CROWN-Q Training Penalty (WORKED — 2026-03-25) |
| 22 | +Added quantization-aware penalty during warmdown: `crownq_lambda * mean(w² * δ² / 12)` where δ = row_max / clip_range. Encourages weights to land on quantization-friendly values. Artifact ~200KB smaller. `CROWN_Q_LAMBDA=0.01`. |
| 23 | + |
| 24 | +### Eval stride 32 → 64 (WORKED — 2026-03-25) |
| 25 | +Halved scoring overhead with no BPB loss. Freed ~100s of eval budget. |
| 26 | + |
| 27 | +### TTT epochs 3 → 4 (WORKED — 2026-03-25) |
| 28 | +Used freed eval time for one extra TTT epoch per chunk. Combined with stride=64 and CROWN-Q, improved 3-seed mean from 1.0745 → 1.0541. |
| 29 | + |
| 30 | +### Mixer Optimization (WORKED — major win) |
| 31 | +The 5-expert Hedge mixer originally took 1573s eval (over 600s budget). Optimized to 336s (with stride=64): |
| 32 | +- Cached `expert_nll` between `mix_and_score()` and `update_weights()` — eliminated redundant `get_expert_log_probs()` call (biggest win) |
| 33 | +- Shared `log_softmax` between neural and entropy experts |
| 34 | +- Replaced GPU-CPU sync conditionals (`if tensor.sum() > 0`) with Python int check (`if self.total_tokens > 0`) |
| 35 | +- In-place `scatter_add_` on flattened views instead of allocating 67M-element temporary tensors |
| 36 | + |
| 37 | +### Bigram table reduction (WORKED) |
| 38 | +Reduced `bigram_vocab_size` from 8192 → 6144. Reliably saves ~310KB from artifact. Surprisingly IMPROVED BPB (1.0973 → 1.0578 for seed 1337) — fewer params means the model trains better in the available steps. |
| 39 | + |
| 40 | +### All-int5 quantization (WORKED) |
| 41 | +Set `int6_last_n=0` (all layers use int5 instead of int6 for last 2 blocks). Saves ~300KB reliably from bitwidth reduction. Combined with bigram=6144, this gives ~500KB margin under 16MB. |
| 42 | + |
| 43 | +### Stripped dead code (WORKED — small) |
| 44 | +Removed unused PPMModel, FastPPMModel, ExactMatchCache classes and interpolate_with_ppm stub. Saved ~11KB from code size. |
| 45 | + |
| 46 | +### GPTQ calibration under training budget (REQUIRED by rules) |
| 47 | +Competition organizers confirmed GPTQ calibration counts as training time because it accesses training data. Must fit within 600s. We reserve 18s from training (loop stops at 582s) for EMA selection + calibration + quantization + compression. Early warmdown: LR schedule targets 582s so warmdown completes before loop stops. |
| 48 | + |
| 49 | +### Skip diagnostic evals (WORKED) |
| 50 | +Removed EMA/SWA diagnostic eval_val() calls post-training. Just use EMA directly. Saves ~5s from the training reserve. |
| 51 | + |
| 52 | +### Reduced GPTQ calibration samples (WORKED) |
| 53 | +256 → 128 samples. Calibration time 3.8s → 1.9s. No measurable quality impact. |
| 54 | + |
| 55 | +## What Didn't Work |
| 56 | + |
| 57 | +### qTTT — Q-projection-only TTT (FAILED — 2026-03-25) |
| 58 | +Only unfreezing Q projections during TTT (with 7 epochs, 6 blocks). Got 1.095 BPB vs 1.056 baseline. Too little adaptation capacity — Q-only can't compensate for frozen K/V/MLP even with more epochs and blocks. |
| 59 | + |
| 60 | +### 4-gram mixer expert (FAILED — 2026-03-25) |
| 61 | +Added a 4th n-gram expert (K=5→6) using 65K hash buckets. Got 1.105 BPB vs 1.056 baseline. Hash collisions + sparse data = noise that hurts mixer convergence. |
| 62 | + |
| 63 | +### MIXER_ETA=0.15 (FAILED — 2026-03-25) |
| 64 | +Higher Hedge learning rate caused overreaction to noisy n-gram experts. 0.1 is the sweet spot. |
| 65 | + |
| 66 | +### 8 TTT epochs (FAILED — 2026-03-25) |
| 67 | +Overfitting: 1.074 BPB vs 1.047 with 4 epochs. Diminishing returns after 4 epochs with lr=0.0001. |
| 68 | + |
| 69 | +### Increased pruning to compensate for fewer training steps (FAILED) |
| 70 | +With 600s training: 1% more pruning saves ~878KB. With 575s training (25s reserve): 1% more pruning saves only ~9KB. Fewer training steps produce fundamentally higher-entropy weights that don't compress well regardless of pruning. |
| 71 | + |
| 72 | +### bigram_vocab_size=4096 (WORSE) |
| 73 | +Smaller than 6144 was counterproductive. BPB went from 1.0578 → 1.0992 and artifact was actually LARGER (GPTQ non-determinism). Sweet spot is 6144. |
| 74 | + |
| 75 | +### LoRA TTT (LEGALITY QUESTION) |
| 76 | +Achieved 1.0732 BPB (3-seed mean) but legality under competition rules is uncertain. Per-document LoRA adaptation at eval time — powerful but may violate the spirit of the rules. |
| 77 | + |
| 78 | +### Large training reserve (25s) (PROBLEMATIC) |
| 79 | +Losing 250 training steps for post-loop overhead hurts model quality AND compression significantly. 18s reserve is the practical minimum (covers EMA + calibration 2s + quantization + save). |
| 80 | + |
| 81 | +### GPTQ calibration on pre-EMA model (FAILED) |
| 82 | +Moving calibration before EMA/SWA selection creates a Hessian mismatch — Hessians from wrong model → suboptimal quantization → larger artifacts. |
| 83 | + |
| 84 | +### Various architecture experiments (MIXED) |
| 85 | +- 12L model: better BPB but always over 16MB |
| 86 | +- MoE: OOM (multiplies params) |
| 87 | +- Depth recurrence (5L×2 loops): much worse than 10L unique |
| 88 | +- Focal loss: distorts CE objective, worse BPB |
| 89 | +- Curriculum learning (1024→2048 seq): 0.12 BPB quant damage from seq mismatch |
| 90 | +- Hyper-connections: marginal signal (-0.003), not worth complexity |
| 91 | +- Entropy regularization: 214ms/step too slow |
| 92 | + |
| 93 | +## Competition Constraints |
| 94 | + |
| 95 | +- Train <= 10 minutes (600s) on 8xH100 — includes GPTQ calibration |
| 96 | +- Eval <= 10 minutes (600s) on 8xH100 |
| 97 | +- Artifact <= 16,000,000 bytes (16 MB, NOT MiB) total (code + compressed model) |
| 98 | +- No training on validation data before scoring it |
| 99 | +- No external downloads during eval |
| 100 | +- GPTQ calibration counts as training time (accesses training data) |
| 101 | + |
| 102 | +## Key Files |
| 103 | + |
| 104 | +Code: |
| 105 | +- `submission-2026-03-25/train_gpt.py` — current best submission (97KB) |
| 106 | +- `submission-2026-03-24/train_gpt.py` — previous submission (96KB, 1.0745 BPB) |
| 107 | +- `submission_2026-03-23/train_gpt.py` — older submission code |
| 108 | + |
| 109 | +Tracking: |
| 110 | +- `experiments.csv` — ~125 experiments tracked |
| 111 | +- `8xh100_AGENT_BRIEF.md` — competition context |
| 112 | + |
| 113 | +## Current Architecture |
| 114 | + |
| 115 | +| Component | Setting | |
| 116 | +|-----------|---------| |
| 117 | +| Layers | 11 (512d, 8H, 8KV) | |
| 118 | +| MLP | 3.5x with LeakyReLU(0.5)^2 | |
| 119 | +| BigramHash | 6144 (dim=128) | |
| 120 | +| XSA | All 11 layers (ws=8) | |
| 121 | +| VE128 | Layers 9-10 | |
| 122 | +| Quantization | Full GPTQ int5 + zstd level 22 | |
| 123 | +| Pruning | 3% magnitude | |
| 124 | +| CROWN-Q | lambda=0.01 during warmdown | |
| 125 | +| TTT | AdamW lr=0.0001, 4 epochs, 131K chunks, Polyak 0.998 | |
| 126 | +| Mixer | 5-expert Hedge (neural, unigram, bigram, trigram, entropy), eta=0.1 | |
| 127 | +| Training reserve | 18s (for EMA + calibration + quantization) | |
| 128 | +| Early warmdown | LR schedule targets 582s | |
| 129 | +| Eval stride | 64 | |
| 130 | + |
| 131 | +## Running Experiments |
| 132 | + |
| 133 | +On gcp-eval-us (8xH100): |
| 134 | +```bash |
| 135 | +cd ~/parameter-golf-8xh100/submission-2026-03-25 |
| 136 | +DATA_PATH=../data/datasets/fineweb10B_sp1024 \ |
| 137 | +TOKENIZER_PATH=../data/tokenizers/fineweb_1024_bpe.model \ |
| 138 | +SEED=1337 MAX_WALLCLOCK_SECONDS=600 \ |
| 139 | +USE_MIXER=1 MIXER_ETA=0.1 \ |
| 140 | +TTT_EPOCHS=4 TTT_FREEZE_BLOCKS=2 \ |
| 141 | +TTT_LR=0.0001 TTT_CHUNK_TOKENS=131072 \ |
| 142 | +ADAPTIVE_LR=1 ADAPTIVE_LR_MAX=3.0 \ |
| 143 | +EVAL_STRIDE=64 \ |
| 144 | +CROWN_Q_LAMBDA=0.01 \ |
| 145 | +~/.venv/bin/torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 146 | +``` |
| 147 | + |
| 148 | +IMPORTANT: Never run two training jobs simultaneously on the same GPUs — this causes 2x slowdown and corrupts results. |
| 149 | + |
| 150 | +## Eval Time Budget |
| 151 | + |
| 152 | +Current eval takes ~336s out of 600s budget. Breakdown: |
| 153 | +- Scoring (sliding window, stride=64): ~85s |
| 154 | +- TTT training (4 epochs × 474 chunks): ~240s |
| 155 | +- Mixer overhead: ~11s |
| 156 | + |
| 157 | +**264s of eval budget remains unused.** This could fit ~2 more TTT epochs (total 6), but 8 was shown to overfit. Sweet spot appears to be 4-5 epochs at lr=0.0001. |
| 158 | + |
| 159 | +## Working Style |
| 160 | + |
| 161 | +- Run one experiment at a time |
| 162 | +- Keep experiments.csv updated |
| 163 | +- Preserve a clear record of hypotheses, changes, and outcomes |
| 164 | +- Prefer high-upside ideas over incremental tuning |
| 165 | +- Call out immediately if an idea seems illegal or unlikely to move the metric |
| 166 | +- DO NOT open pull requests or push to any remote repository |
| 167 | + |
| 168 | +## What to Focus On |
| 169 | + |
| 170 | +We need to close the gap from 1.0541 → 1.0 BPB. Study other submissions for inspiration: https://github.com/openai/parameter-golf/pulls |
| 171 | + |
| 172 | +Ideas worth exploring: |
| 173 | +- **TTT tuning**: 5 epochs with lower LR (0.00008), different chunk sizes, different Polyak decay |
| 174 | +- **Training improvements**: depth recurrence (PR #686), VRL across all layers, SWA/EMA 50/50 blend (PR #692) |
| 175 | +- **Mixer improvements**: better smoothing for n-grams, adaptive eta decay, per-window mixing |
| 176 | +- **Compression**: codebook quantization, Huffman encoding instead of zstd |
| 177 | + |
| 178 | +Prioritize ideas that are both original and legally defensible. Avoid gray-area eval tricks. |
0 commit comments