Skip to content

Commit 30e7835

Browse files
RoyiRaclaude
andcommitted
docs: update ramp_up_prompt.md for V27 (1.0541 BPB)
- Updated current best to 1.0541 (was 1.0745) - Added CROWN-Q, stride=64, 4 TTT epochs to what worked - Added qTTT, 4-gram, eta=0.15, 8 epochs to what failed - Updated architecture table, run command, key files - Added eval time budget breakdown (264s spare) - Updated focus areas for next agent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 57d1d2c commit 30e7835

File tree

1 file changed

+178
-0
lines changed

1 file changed

+178
-0
lines changed

ramp_up_prompt.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
I want your help competing in Parameter Golf. Our target is to get to 1.0 BPB. The submission must be fully legal under the rules at https://github.com/openai/parameter-golf
2+
3+
You are continuing from prior work. Use the existing project memory and codebase carefully.
4+
5+
The backward-looking TTT constraint is critical and non-negotiable: the model may never train on a token before that token has already been scored.
6+
7+
## Current Best
8+
9+
**1.0541 BPB** (3-seed mean), achieved with 5-expert Hedge Mixer + CROWN-Q + TTT.
10+
11+
| Seed | BPB | Artifact | Eval time |
12+
|------|-----|----------|-----------|
13+
| 1337 | 1.0473 | 15.89 MB | 336s |
14+
| 42 | 1.0686 | 15.69 MB | 336s |
15+
| 7 | 1.0465 | 15.66 MB | 336s |
16+
17+
Code: `submission-2026-03-25/train_gpt.py` (97KB)
18+
19+
## What We Tried and What Worked
20+
21+
### CROWN-Q Training Penalty (WORKED — 2026-03-25)
22+
Added quantization-aware penalty during warmdown: `crownq_lambda * mean(w² * δ² / 12)` where δ = row_max / clip_range. Encourages weights to land on quantization-friendly values. Artifact ~200KB smaller. `CROWN_Q_LAMBDA=0.01`.
23+
24+
### Eval stride 32 → 64 (WORKED — 2026-03-25)
25+
Halved scoring overhead with no BPB loss. Freed ~100s of eval budget.
26+
27+
### TTT epochs 3 → 4 (WORKED — 2026-03-25)
28+
Used freed eval time for one extra TTT epoch per chunk. Combined with stride=64 and CROWN-Q, improved 3-seed mean from 1.0745 → 1.0541.
29+
30+
### Mixer Optimization (WORKED — major win)
31+
The 5-expert Hedge mixer originally took 1573s eval (over 600s budget). Optimized to 336s (with stride=64):
32+
- Cached `expert_nll` between `mix_and_score()` and `update_weights()` — eliminated redundant `get_expert_log_probs()` call (biggest win)
33+
- Shared `log_softmax` between neural and entropy experts
34+
- Replaced GPU-CPU sync conditionals (`if tensor.sum() > 0`) with Python int check (`if self.total_tokens > 0`)
35+
- In-place `scatter_add_` on flattened views instead of allocating 67M-element temporary tensors
36+
37+
### Bigram table reduction (WORKED)
38+
Reduced `bigram_vocab_size` from 8192 → 6144. Reliably saves ~310KB from artifact. Surprisingly IMPROVED BPB (1.0973 → 1.0578 for seed 1337) — fewer params means the model trains better in the available steps.
39+
40+
### All-int5 quantization (WORKED)
41+
Set `int6_last_n=0` (all layers use int5 instead of int6 for last 2 blocks). Saves ~300KB reliably from bitwidth reduction. Combined with bigram=6144, this gives ~500KB margin under 16MB.
42+
43+
### Stripped dead code (WORKED — small)
44+
Removed unused PPMModel, FastPPMModel, ExactMatchCache classes and interpolate_with_ppm stub. Saved ~11KB from code size.
45+
46+
### GPTQ calibration under training budget (REQUIRED by rules)
47+
Competition organizers confirmed GPTQ calibration counts as training time because it accesses training data. Must fit within 600s. We reserve 18s from training (loop stops at 582s) for EMA selection + calibration + quantization + compression. Early warmdown: LR schedule targets 582s so warmdown completes before loop stops.
48+
49+
### Skip diagnostic evals (WORKED)
50+
Removed EMA/SWA diagnostic eval_val() calls post-training. Just use EMA directly. Saves ~5s from the training reserve.
51+
52+
### Reduced GPTQ calibration samples (WORKED)
53+
256 → 128 samples. Calibration time 3.8s → 1.9s. No measurable quality impact.
54+
55+
## What Didn't Work
56+
57+
### qTTT — Q-projection-only TTT (FAILED — 2026-03-25)
58+
Only unfreezing Q projections during TTT (with 7 epochs, 6 blocks). Got 1.095 BPB vs 1.056 baseline. Too little adaptation capacity — Q-only can't compensate for frozen K/V/MLP even with more epochs and blocks.
59+
60+
### 4-gram mixer expert (FAILED — 2026-03-25)
61+
Added a 4th n-gram expert (K=5→6) using 65K hash buckets. Got 1.105 BPB vs 1.056 baseline. Hash collisions + sparse data = noise that hurts mixer convergence.
62+
63+
### MIXER_ETA=0.15 (FAILED — 2026-03-25)
64+
Higher Hedge learning rate caused overreaction to noisy n-gram experts. 0.1 is the sweet spot.
65+
66+
### 8 TTT epochs (FAILED — 2026-03-25)
67+
Overfitting: 1.074 BPB vs 1.047 with 4 epochs. Diminishing returns after 4 epochs with lr=0.0001.
68+
69+
### Increased pruning to compensate for fewer training steps (FAILED)
70+
With 600s training: 1% more pruning saves ~878KB. With 575s training (25s reserve): 1% more pruning saves only ~9KB. Fewer training steps produce fundamentally higher-entropy weights that don't compress well regardless of pruning.
71+
72+
### bigram_vocab_size=4096 (WORSE)
73+
Smaller than 6144 was counterproductive. BPB went from 1.0578 → 1.0992 and artifact was actually LARGER (GPTQ non-determinism). Sweet spot is 6144.
74+
75+
### LoRA TTT (LEGALITY QUESTION)
76+
Achieved 1.0732 BPB (3-seed mean) but legality under competition rules is uncertain. Per-document LoRA adaptation at eval time — powerful but may violate the spirit of the rules.
77+
78+
### Large training reserve (25s) (PROBLEMATIC)
79+
Losing 250 training steps for post-loop overhead hurts model quality AND compression significantly. 18s reserve is the practical minimum (covers EMA + calibration 2s + quantization + save).
80+
81+
### GPTQ calibration on pre-EMA model (FAILED)
82+
Moving calibration before EMA/SWA selection creates a Hessian mismatch — Hessians from wrong model → suboptimal quantization → larger artifacts.
83+
84+
### Various architecture experiments (MIXED)
85+
- 12L model: better BPB but always over 16MB
86+
- MoE: OOM (multiplies params)
87+
- Depth recurrence (5L×2 loops): much worse than 10L unique
88+
- Focal loss: distorts CE objective, worse BPB
89+
- Curriculum learning (1024→2048 seq): 0.12 BPB quant damage from seq mismatch
90+
- Hyper-connections: marginal signal (-0.003), not worth complexity
91+
- Entropy regularization: 214ms/step too slow
92+
93+
## Competition Constraints
94+
95+
- Train <= 10 minutes (600s) on 8xH100 — includes GPTQ calibration
96+
- Eval <= 10 minutes (600s) on 8xH100
97+
- Artifact <= 16,000,000 bytes (16 MB, NOT MiB) total (code + compressed model)
98+
- No training on validation data before scoring it
99+
- No external downloads during eval
100+
- GPTQ calibration counts as training time (accesses training data)
101+
102+
## Key Files
103+
104+
Code:
105+
- `submission-2026-03-25/train_gpt.py` — current best submission (97KB)
106+
- `submission-2026-03-24/train_gpt.py` — previous submission (96KB, 1.0745 BPB)
107+
- `submission_2026-03-23/train_gpt.py` — older submission code
108+
109+
Tracking:
110+
- `experiments.csv`~125 experiments tracked
111+
- `8xh100_AGENT_BRIEF.md` — competition context
112+
113+
## Current Architecture
114+
115+
| Component | Setting |
116+
|-----------|---------|
117+
| Layers | 11 (512d, 8H, 8KV) |
118+
| MLP | 3.5x with LeakyReLU(0.5)^2 |
119+
| BigramHash | 6144 (dim=128) |
120+
| XSA | All 11 layers (ws=8) |
121+
| VE128 | Layers 9-10 |
122+
| Quantization | Full GPTQ int5 + zstd level 22 |
123+
| Pruning | 3% magnitude |
124+
| CROWN-Q | lambda=0.01 during warmdown |
125+
| TTT | AdamW lr=0.0001, 4 epochs, 131K chunks, Polyak 0.998 |
126+
| Mixer | 5-expert Hedge (neural, unigram, bigram, trigram, entropy), eta=0.1 |
127+
| Training reserve | 18s (for EMA + calibration + quantization) |
128+
| Early warmdown | LR schedule targets 582s |
129+
| Eval stride | 64 |
130+
131+
## Running Experiments
132+
133+
On gcp-eval-us (8xH100):
134+
```bash
135+
cd ~/parameter-golf-8xh100/submission-2026-03-25
136+
DATA_PATH=../data/datasets/fineweb10B_sp1024 \
137+
TOKENIZER_PATH=../data/tokenizers/fineweb_1024_bpe.model \
138+
SEED=1337 MAX_WALLCLOCK_SECONDS=600 \
139+
USE_MIXER=1 MIXER_ETA=0.1 \
140+
TTT_EPOCHS=4 TTT_FREEZE_BLOCKS=2 \
141+
TTT_LR=0.0001 TTT_CHUNK_TOKENS=131072 \
142+
ADAPTIVE_LR=1 ADAPTIVE_LR_MAX=3.0 \
143+
EVAL_STRIDE=64 \
144+
CROWN_Q_LAMBDA=0.01 \
145+
~/.venv/bin/torchrun --standalone --nproc_per_node=8 train_gpt.py
146+
```
147+
148+
IMPORTANT: Never run two training jobs simultaneously on the same GPUs — this causes 2x slowdown and corrupts results.
149+
150+
## Eval Time Budget
151+
152+
Current eval takes ~336s out of 600s budget. Breakdown:
153+
- Scoring (sliding window, stride=64): ~85s
154+
- TTT training (4 epochs × 474 chunks): ~240s
155+
- Mixer overhead: ~11s
156+
157+
**264s of eval budget remains unused.** This could fit ~2 more TTT epochs (total 6), but 8 was shown to overfit. Sweet spot appears to be 4-5 epochs at lr=0.0001.
158+
159+
## Working Style
160+
161+
- Run one experiment at a time
162+
- Keep experiments.csv updated
163+
- Preserve a clear record of hypotheses, changes, and outcomes
164+
- Prefer high-upside ideas over incremental tuning
165+
- Call out immediately if an idea seems illegal or unlikely to move the metric
166+
- DO NOT open pull requests or push to any remote repository
167+
168+
## What to Focus On
169+
170+
We need to close the gap from 1.0541 → 1.0 BPB. Study other submissions for inspiration: https://github.com/openai/parameter-golf/pulls
171+
172+
Ideas worth exploring:
173+
- **TTT tuning**: 5 epochs with lower LR (0.00008), different chunk sizes, different Polyak decay
174+
- **Training improvements**: depth recurrence (PR #686), VRL across all layers, SWA/EMA 50/50 blend (PR #692)
175+
- **Mixer improvements**: better smoothing for n-grams, adaptive eta decay, per-window mixing
176+
- **Compression**: codebook quantization, Huffman encoding instead of zstd
177+
178+
Prioritize ideas that are both original and legally defensible. Avoid gray-area eval tricks.

0 commit comments

Comments
 (0)