docs: update ramp_up_prompt.md for V27 (1.0541 BPB)

RoyiRa · claude · RoyiRa · commit 30e78352c4e2 · 2026-03-25T12:52:54.000+02:00
- Updated current best to 1.0541 (was 1.0745)
- Added CROWN-Q, stride=64, 4 TTT epochs to what worked
- Added qTTT, 4-gram, eta=0.15, 8 epochs to what failed
- Updated architecture table, run command, key files
- Added eval time budget breakdown (264s spare)
- Updated focus areas for next agent

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/ramp_up_prompt.md b/ramp_up_prompt.md
@@ -0,0 +1,178 @@
+I want your help competing in Parameter Golf. Our target is to get to 1.0 BPB. The submission must be fully legal under the rules at https://github.com/openai/parameter-golf
+
+You are continuing from prior work. Use the existing project memory and codebase carefully.
+
+The backward-looking TTT constraint is critical and non-negotiable: the model may never train on a token before that token has already been scored.
+
+## Current Best
+
+**1.0541 BPB** (3-seed mean), achieved with 5-expert Hedge Mixer + CROWN-Q + TTT.
+
+| Seed | BPB | Artifact | Eval time |
+|------|-----|----------|-----------|
+| 1337 | 1.0473 | 15.89 MB | 336s |
+| 42 | 1.0686 | 15.69 MB | 336s |
+| 7 | 1.0465 | 15.66 MB | 336s |
+
+Code: `submission-2026-03-25/train_gpt.py` (97KB)
+
+## What We Tried and What Worked
+
+### CROWN-Q Training Penalty (WORKED — 2026-03-25)
+Added quantization-aware penalty during warmdown: `crownq_lambda * mean(w² * δ² / 12)` where δ = row_max / clip_range. Encourages weights to land on quantization-friendly values. Artifact ~200KB smaller. `CROWN_Q_LAMBDA=0.01`.
+
+### Eval stride 32 → 64 (WORKED — 2026-03-25)
+Halved scoring overhead with no BPB loss. Freed ~100s of eval budget.
+
+### TTT epochs 3 → 4 (WORKED — 2026-03-25)
+Used freed eval time for one extra TTT epoch per chunk. Combined with stride=64 and CROWN-Q, improved 3-seed mean from 1.0745 → 1.0541.
+
+### Mixer Optimization (WORKED — major win)
+The 5-expert Hedge mixer originally took 1573s eval (over 600s budget). Optimized to 336s (with stride=64):
+- Cached `expert_nll` between `mix_and_score()` and `update_weights()` — eliminated redundant `get_expert_log_probs()` call (biggest win)
+- Shared `log_softmax` between neural and entropy experts
+- Replaced GPU-CPU sync conditionals (`if tensor.sum() > 0`) with Python int check (`if self.total_tokens > 0`)
+- In-place `scatter_add_` on flattened views instead of allocating 67M-element temporary tensors
+
+### Bigram table reduction (WORKED)
+Reduced `bigram_vocab_size` from 8192 → 6144. Reliably saves ~310KB from artifact. Surprisingly IMPROVED BPB (1.0973 → 1.0578 for seed 1337) — fewer params means the model trains better in the available steps.
+
+### All-int5 quantization (WORKED)
+Set `int6_last_n=0` (all layers use int5 instead of int6 for last 2 blocks). Saves ~300KB reliably from bitwidth reduction. Combined with bigram=6144, this gives ~500KB margin under 16MB.
+
+### Stripped dead code (WORKED — small)
+Removed unused PPMModel, FastPPMModel, ExactMatchCache classes and interpolate_with_ppm stub. Saved ~11KB from code size.
+
+### GPTQ calibration under training budget (REQUIRED by rules)
+Competition organizers confirmed GPTQ calibration counts as training time because it accesses training data. Must fit within 600s. We reserve 18s from training (loop stops at 582s) for EMA selection + calibration + quantization + compression. Early warmdown: LR schedule targets 582s so warmdown completes before loop stops.
+
+### Skip diagnostic evals (WORKED)
+Removed EMA/SWA diagnostic eval_val() calls post-training. Just use EMA directly. Saves ~5s from the training reserve.
+
+### Reduced GPTQ calibration samples (WORKED)
+256 → 128 samples. Calibration time 3.8s → 1.9s. No measurable quality impact.
+
+## What Didn't Work
+
+### qTTT — Q-projection-only TTT (FAILED — 2026-03-25)
+Only unfreezing Q projections during TTT (with 7 epochs, 6 blocks). Got 1.095 BPB vs 1.056 baseline. Too little adaptation capacity — Q-only can't compensate for frozen K/V/MLP even with more epochs and blocks.
+
+### 4-gram mixer expert (FAILED — 2026-03-25)
+Added a 4th n-gram expert (K=5→6) using 65K hash buckets. Got 1.105 BPB vs 1.056 baseline. Hash collisions + sparse data = noise that hurts mixer convergence.
+
+### MIXER_ETA=0.15 (FAILED — 2026-03-25)
+Higher Hedge learning rate caused overreaction to noisy n-gram experts. 0.1 is the sweet spot.
+
+### 8 TTT epochs (FAILED — 2026-03-25)
+Overfitting: 1.074 BPB vs 1.047 with 4 epochs. Diminishing returns after 4 epochs with lr=0.0001.
+
+### Increased pruning to compensate for fewer training steps (FAILED)
+With 600s training: 1% more pruning saves ~878KB. With 575s training (25s reserve): 1% more pruning saves only ~9KB. Fewer training steps produce fundamentally higher-entropy weights that don't compress well regardless of pruning.
+
+### bigram_vocab_size=4096 (WORSE)
+Smaller than 6144 was counterproductive. BPB went from 1.0578 → 1.0992 and artifact was actually LARGER (GPTQ non-determinism). Sweet spot is 6144.
+
+### LoRA TTT (LEGALITY QUESTION)
+Achieved 1.0732 BPB (3-seed mean) but legality under competition rules is uncertain. Per-document LoRA adaptation at eval time — powerful but may violate the spirit of the rules.
+
+### Large training reserve (25s) (PROBLEMATIC)
+Losing 250 training steps for post-loop overhead hurts model quality AND compression significantly. 18s reserve is the practical minimum (covers EMA + calibration 2s + quantization + save).
+
+### GPTQ calibration on pre-EMA model (FAILED)
+Moving calibration before EMA/SWA selection creates a Hessian mismatch — Hessians from wrong model → suboptimal quantization → larger artifacts.
+
+### Various architecture experiments (MIXED)
+- 12L model: better BPB but always over 16MB
+- MoE: OOM (multiplies params)
+- Depth recurrence (5L×2 loops): much worse than 10L unique
+- Focal loss: distorts CE objective, worse BPB
+- Curriculum learning (1024→2048 seq): 0.12 BPB quant damage from seq mismatch
+- Hyper-connections: marginal signal (-0.003), not worth complexity
+- Entropy regularization: 214ms/step too slow
+
+## Competition Constraints
+
+- Train <= 10 minutes (600s) on 8xH100 — includes GPTQ calibration
+- Eval <= 10 minutes (600s) on 8xH100
+- Artifact <= 16,000,000 bytes (16 MB, NOT MiB) total (code + compressed model)
+- No training on validation data before scoring it
+- No external downloads during eval
+- GPTQ calibration counts as training time (accesses training data)
+
+## Key Files
+
+Code:
+- `submission-2026-03-25/train_gpt.py` — current best submission (97KB)
+- `submission-2026-03-24/train_gpt.py` — previous submission (96KB, 1.0745 BPB)
+- `submission_2026-03-23/train_gpt.py` — older submission code
+
+Tracking:
+- `experiments.csv` — ~125 experiments tracked
+- `8xh100_AGENT_BRIEF.md` — competition context
+
+## Current Architecture
+
+| Component | Setting |
+|-----------|---------|
+| Layers | 11 (512d, 8H, 8KV) |
+| MLP | 3.5x with LeakyReLU(0.5)^2 |
+| BigramHash | 6144 (dim=128) |
+| XSA | All 11 layers (ws=8) |
+| VE128 | Layers 9-10 |
+| Quantization | Full GPTQ int5 + zstd level 22 |
+| Pruning | 3% magnitude |
+| CROWN-Q | lambda=0.01 during warmdown |
+| TTT | AdamW lr=0.0001, 4 epochs, 131K chunks, Polyak 0.998 |
+| Mixer | 5-expert Hedge (neural, unigram, bigram, trigram, entropy), eta=0.1 |
+| Training reserve | 18s (for EMA + calibration + quantization) |
+| Early warmdown | LR schedule targets 582s |
+| Eval stride | 64 |
+
+## Running Experiments
+
+On gcp-eval-us (8xH100):
+```bash
+cd ~/parameter-golf-8xh100/submission-2026-03-25
+DATA_PATH=../data/datasets/fineweb10B_sp1024 \
+TOKENIZER_PATH=../data/tokenizers/fineweb_1024_bpe.model \
+SEED=1337 MAX_WALLCLOCK_SECONDS=600 \
+USE_MIXER=1 MIXER_ETA=0.1 \
+TTT_EPOCHS=4 TTT_FREEZE_BLOCKS=2 \
+TTT_LR=0.0001 TTT_CHUNK_TOKENS=131072 \
+ADAPTIVE_LR=1 ADAPTIVE_LR_MAX=3.0 \
+EVAL_STRIDE=64 \
+CROWN_Q_LAMBDA=0.01 \
+~/.venv/bin/torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+IMPORTANT: Never run two training jobs simultaneously on the same GPUs — this causes 2x slowdown and corrupts results.
+
+## Eval Time Budget
+
+Current eval takes ~336s out of 600s budget. Breakdown:
+- Scoring (sliding window, stride=64): ~85s
+- TTT training (4 epochs × 474 chunks): ~240s
+- Mixer overhead: ~11s
+
+**264s of eval budget remains unused.** This could fit ~2 more TTT epochs (total 6), but 8 was shown to overfit. Sweet spot appears to be 4-5 epochs at lr=0.0001.
+
+## Working Style
+
+- Run one experiment at a time
+- Keep experiments.csv updated
+- Preserve a clear record of hypotheses, changes, and outcomes
+- Prefer high-upside ideas over incremental tuning
+- Call out immediately if an idea seems illegal or unlikely to move the metric
+- DO NOT open pull requests or push to any remote repository
+
+## What to Focus On
+
+We need to close the gap from 1.0541 → 1.0 BPB. Study other submissions for inspiration: https://github.com/openai/parameter-golf/pulls
+
+Ideas worth exploring:
+- **TTT tuning**: 5 epochs with lower LR (0.00008), different chunk sizes, different Polyak decay
+- **Training improvements**: depth recurrence (PR #686), VRL across all layers, SWA/EMA 50/50 blend (PR #692)
+- **Mixer improvements**: better smoothing for n-grams, adaptive eta decay, per-window mixing
+- **Compression**: codebook quantization, Huffman encoding instead of zstd
+
+Prioritize ideas that are both original and legally defensible. Avoid gray-area eval tricks.