Skip to content

Commit 76cab36

Browse files
eamon831claude
andcommitted
Full session context dump — research findings, status, next steps
CLAUDE.md: Complete project state for cross-session continuity - Leaderboard intel (verified SOTA + unverified PRs openai#1006, openai#999, openai#831) - 8192 vocab analysis (doesn't fit — only 9,994 bytes headroom) - Three planned improvements with code status - Environment setup instructions (Mac MLX + RunPod H100) - Codebase layout and git remotes experiments.md: 4 planned experiments with commands + success criteria Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent f65800c commit 76cab36

File tree

2 files changed

+187
-22
lines changed

2 files changed

+187
-22
lines changed

CLAUDE.md

Lines changed: 121 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -168,29 +168,132 @@ Beat the naive baseline (1.2244 BPB). Stretch goal: crack the top 10 (< 1.1570 B
168168

169169
## Current Leaderboard Context (2026-03-28)
170170

171-
- SOTA: 1.1194 BPB (LeakyReLU² + TTT + Parallel Muon)
172-
- Baseline: 1.2244 BPB
173-
- Gap: 0.105 BPB across ~20 submissions in 10 days
171+
**Verified SOTA (merged):** 1.1194 BPB — PR #549 (LeakyReLU² + Legal TTT + Parallel Muon) by abaybektursun
172+
**Baseline:** 1.2244 BPB
173+
174+
**Unverified PRs (open, not yet merged — check status next session):**
175+
- PR #1006: claims 1.1085 BPB — JEPA + AdamW TTT + Full Hessian GPTQ (1 seed only, needs 3)
176+
- PR #999: claims 1.1179 BPB — Entropy-Adaptive TTT epochs (3 seeds, looks solid)
177+
- PR #831: research paper — all 6 novel architectures tested FAILED at 16MB
178+
179+
**Competition deadline:** April 30, 2026
174180

175181
## Strategy
176182

177-
Based on research (rules above), our highest-ROI path:
183+
Based on deep research of all 25+ submissions:
184+
185+
1. **Start from verified SOTA code** (PR #549) — copied to `our_train_gpt.py`
186+
2. **Add JEPA** (from PR #1006) — auxiliary loss, small parameter cost, DONE in code
187+
3. **Add Full Hessian GPTQ** (from PR #1006) — replaces GPTQ-lite, 13 seconds
188+
4. **Add AdamW TTT pre-quantization** (from PR #1006) — SGD fails on CastedLinear
189+
5. **Ablate each change** one at a time on H100
190+
191+
### What We Investigated and Rejected
192+
193+
**8192 vocab (Rule 3):** Biggest single lever (-0.42 BPB proven by ternary author), BUT:
194+
- SOTA artifact is 15,990,006 bytes — only 9,994 bytes headroom
195+
- 8192×512 embedding needs +2.5MB compressed — doesn't fit
196+
- Factored 8192×128 + 128×512 still needs +440KB — doesn't fit
197+
- Factored 8192×64 + 64×512 needs +25KB — barely doesn't fit
198+
- Only works with ternary quantization (4x more compact) — completely different codebase
199+
- **VERDICT: not viable on the int6/GPTQ path. Would require rebuilding from ternary base.**
200+
201+
### Three Concrete Improvements to Test (from PR #1006)
202+
203+
**1. JEPA (Joint-Embedding Predictive Architecture)**
204+
- Predicts future hidden states across multiple horizons (1,2,4,8 steps)
205+
- Uses encoder output → context_encoder → predictor → compare with target_encoder(decoder output)
206+
- Target encoder updated via EMA (decay=0.996), no gradients
207+
- VICReg-style variance/covariance regularization prevents collapse
208+
- Loss weight: 0.12, latent dim: 256
209+
- Extra params: ~130K (3 LatentProjector modules + span embeddings)
210+
- **STATUS: Code added to our_train_gpt.py, toggled via JEPA_ENABLED=1**
211+
212+
**2. Full Hessian GPTQ (replaces GPTQ-lite)**
213+
- Collects H=X^TX via forward hooks during 128-batch calibration
214+
- Per-column rounding error compensated using inverse Hessian
215+
- Block size 128, percdamp 0.01
216+
- Takes 13 seconds — fits in eval budget
217+
- **STATUS: Not yet implemented in our code. Reference: PR #1006 train_gpt.py**
218+
219+
**3. AdamW TTT Pre-Quantization**
220+
- Key finding: SGD-based TTT fails on CastedLinear architectures
221+
- Fix: AdamW with cosine decay on EMA-averaged model BEFORE quantization
222+
- GPTQ then quantizes the adapted weights
223+
- **STATUS: Not yet implemented. Reference: PR #1006 train_gpt.py**
224+
225+
### Key Research Findings (PR #831)
226+
227+
**The throughput tax formula:** Any technique must improve BPB by 0.007 per millisecond of step time overhead. At 83ms/step baseline, each 1ms costs ~7 steps, each step ≈ 0.001 BPB.
228+
229+
**All 6 novel architectures failed:**
230+
| Technique | ms/step impact | BPB | Why failed |
231+
|-----------|---------------|-----|-----------|
232+
| MUD Optimizer | +5% | 1.1581 | solve_triangular can't use tensor cores |
233+
| Info-Max (XSA-all) | +7% | 1.1261 | overhead eats its own gain |
234+
| Hourglass FFN | +11% | 1.4519 | split weights catastrophic for int6 |
235+
| nGPT Hypersphere | +47% | 1.6915 | unit-norm incompatible with int6 |
236+
| TrigramHash | +18% | 1.1298 | hash overhead > trigram benefit |
237+
| SSM Hybrid | +240% | 1.2516 | breaks torch.compile |
238+
239+
**Takeaway:** The SOTA stack is co-optimized (Parallel Muon + torch.compile + int6 + tensor cores). Breaking any pillar cascades. JEPA is one of the few additions that doesn't break this co-optimization because it's only active during training (no eval cost).
240+
241+
## Codebase Layout
178242

179-
1. **Start from SOTA code** (rule 4) — don't rebuild
180-
2. **Explore larger vocab** (rule 3) — 8192 BPE is the biggest untapped lever in the SOTA path
181-
3. **Factored embeddings** to fit larger vocab in 16MB budget
182-
4. **Keep everything that works** — LeakyReLU², TTT, Parallel Muon, EMA+SWA, XSA, BigramHash, GPTQ-lite
183-
5. **Ablate each change** (rule 6) — measure before and after
184-
6. **Never add ms/step without measuring** (rule 1)
243+
```
244+
parameter-golf/
245+
├── CLAUDE.md ← this file (project rules + context)
246+
├── experiments.md ← experiment log (track all runs)
247+
├── our_train_gpt.py ← OUR working code (SOTA + JEPA added)
248+
├── train_gpt.py ← baseline from OpenAI (9L, 1.2244 BPB)
249+
├── train_gpt_mlx.py ← MLX version for Mac testing
250+
├── data/
251+
│ ├── datasets/fineweb10B_sp1024/ ← downloaded (1 shard + val)
252+
│ └── tokenizers/ ← BPE tokenizer (1024 vocab)
253+
├── records/
254+
│ ├── track_10min_16mb/ ← all leaderboard submissions
255+
│ └── track_non_record_16mb/ ← unlimited compute / research
256+
└── .venv/ ← Python venv (MLX + deps)
257+
```
185258

186-
## Workflow
259+
## Environment Setup
187260

188-
- Develop locally, don't push until ready
189-
- Test on Mac (MLX) for quick iteration on architecture changes
190-
- Run real benchmarks on RunPod (H100) for timing + BPB
191-
- Submit PR to openai/parameter-golf when competitive
261+
**Local (Mac M1):**
262+
```bash
263+
cd ~/office_projects/parameter-golf
264+
source .venv/bin/activate
265+
# Smoke test:
266+
RUN_ID=mlx_smoke ITERATIONS=50 TRAIN_BATCH_TOKENS=8192 VAL_LOSS_EVERY=0 VAL_BATCH_SIZE=8192 python3 train_gpt_mlx.py
267+
```
268+
MLX venv has: mlx, mlx-lm, numpy, sentencepiece, huggingface-hub, datasets, tqdm
192269

193-
## Remotes
270+
**RunPod (H100) — when credits arrive:**
271+
```bash
272+
# Use official template: https://console.runpod.io/deploy?template=y5cejece4j&ref=nl2r56th
273+
cd /workspace && git clone https://github.com/eamon831/parameter-golf.git && cd parameter-golf
274+
python3 data/cached_challenge_fineweb.py --variant sp1024
275+
# Run SOTA baseline (Experiment 0):
276+
SEED=1337 RUN_ID=exp0_reproduce torchrun --standalone --nproc_per_node=8 records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/train_gpt.py
277+
# Run our code with JEPA (Experiment 1):
278+
SEED=1337 JEPA_ENABLED=1 RUN_ID=exp1_jepa torchrun --standalone --nproc_per_node=8 our_train_gpt.py
279+
```
280+
281+
## Blockers
282+
283+
- **RunPod credits:** Applied 2026-03-28, approval in 1-2 business days. Quick-start tier (~$25 / ~8 compute hours). Can also use Saiful's existing RunPod account.
284+
- **MLX validation is slow:** ~20min for full val on Mac. Training is fine for directional testing.
285+
- **our_train_gpt.py requires CUDA + flash_attn_interface:** Cannot test locally. MLX script is separate.
286+
287+
## Git Remotes
288+
289+
- `origin``eamon831/parameter-golf` (our fork — push here)
290+
- `upstream``openai/parameter-golf` (submit PRs here)
291+
- GitHub auth: `eamon831` account via keyring
292+
293+
## Workflow
194294

195-
- `origin` → eamon831/parameter-golf (fork)
196-
- `upstream` → openai/parameter-golf (submit PRs here)
295+
1. `git fetch upstream` before every session — check new submissions
296+
2. Develop in `our_train_gpt.py` — test on RunPod
297+
3. Log every experiment in `experiments.md`
298+
4. When competitive: create submission folder in `records/`, PR to upstream
299+
5. Don't push to upstream until we have 3-seed results with p<0.01

experiments.md

Lines changed: 66 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,72 @@ All experiments tracked here. Never re-run a failed experiment without a new hyp
55
---
66

77
## Experiment 0: Reproduce SOTA baseline
8-
- Date: pending (waiting for RunPod credits)
9-
- Hypothesis: SOTA code runs as documented and produces ~1.1194 BPB
8+
- Date: pending (waiting for RunPod credits, applied 2026-03-28)
9+
- Hypothesis: SOTA code (PR #549) runs as documented and produces ~1.1194 BPB
1010
- Change: none — run SOTA train_gpt.py unmodified
11-
- Hardware: 8xH100 (required for Parallel Muon)
11+
- Hardware: 8xH100 SXM (required for Parallel Muon)
12+
- Command: `SEED=1337 RUN_ID=exp0_reproduce torchrun --standalone --nproc_per_node=8 records/track_10min_16mb/2026-03-23_LeakyReLU_LegalTTT_ParallelMuon/train_gpt.py`
13+
- Expected: val_bpb ~1.1192 (seed 1337 from their logs), ms/step ~83ms, artifact ~15.97MB
1214
- Result: pending
1315
- Verdict: pending
14-
- Cost: ~$3.30 (10 min training + 10 min eval)
16+
- Cost: ~$3.30 (10 min training + 10 min eval on 8xH100 @ ~$20/hr)
17+
18+
## Experiment 0.5: Verify our code matches SOTA (JEPA off)
19+
- Date: pending
20+
- Hypothesis: our_train_gpt.py with JEPA_ENABLED=0 produces identical results to original SOTA
21+
- Change: use our code instead of original — should be identical since JEPA defaults to off
22+
- Hardware: 8xH100 SXM
23+
- Command: `SEED=1337 JEPA_ENABLED=0 RUN_ID=exp05_ours_baseline torchrun --standalone --nproc_per_node=8 our_train_gpt.py`
24+
- Expected: val_bpb ~1.1192 (same as Experiment 0)
25+
- Result: pending
26+
- Verdict: pending
27+
28+
## Experiment 1: JEPA auxiliary loss
29+
- Date: pending
30+
- Hypothesis: JEPA auxiliary loss improves BPB by acting as regularizer (PR #1006 claims it contributes to 1.1085). Extra ~130K params + forward compute for JEPA loss should add <5ms/step.
31+
- Change: JEPA_ENABLED=1 (latent_dim=256, loss_weight=0.12, future_spans=1,2,4,8, ema_decay=0.996)
32+
- Hardware: 8xH100 SXM
33+
- Command: `SEED=1337 JEPA_ENABLED=1 RUN_ID=exp1_jepa torchrun --standalone --nproc_per_node=8 our_train_gpt.py`
34+
- Success criteria: BPB improves by >0.001 AND ms/step increases by <5ms
35+
- Fail criteria: BPB worsens OR ms/step increases by >5ms (net negative per Rule 1)
36+
- Result: pending
37+
- Verdict: pending
38+
39+
## Experiment 2: Full Hessian GPTQ (planned, not yet coded)
40+
- Date: pending
41+
- Hypothesis: Full GPTQ (Frantar et al.) gives better int6 quantization than GPTQ-lite by compensating per-column rounding error using inverse Hessian. PR #1006 reports 13s runtime.
42+
- Change: Replace GPTQ-lite quantization with full Hessian GPTQ + 128-batch calibration
43+
- Hardware: 8xH100 SXM
44+
- Success criteria: Post-quantization BPB improves (lower quantization penalty)
45+
- Code status: NOT YET IMPLEMENTED. Reference implementation in PR #1006.
46+
- Result: pending
47+
48+
## Experiment 3: AdamW TTT pre-quantization (planned, not yet coded)
49+
- Date: pending
50+
- Hypothesis: AdamW TTT on full-precision EMA weights before quantization gives better adaptation than SGD TTT on dequantized weights. PR #1006 found SGD fails on CastedLinear.
51+
- Change: Replace SGD TTT with AdamW + cosine decay, run before GPTQ instead of after
52+
- Hardware: 8xH100 SXM
53+
- Success criteria: TTT BPB gain > current 0.0025 from SOTA's SGD TTT
54+
- Code status: NOT YET IMPLEMENTED. Reference implementation in PR #1006.
55+
- Result: pending
56+
57+
---
58+
59+
## Research Notes (no experiment needed)
60+
61+
### 8192 Vocab Analysis (2026-03-28)
62+
- INVESTIGATED: Can we fit 8192 vocab in the SOTA's 16MB budget?
63+
- ANSWER: No. SOTA artifact has only 9,994 bytes headroom. Even factored 8192×64 + 64×512 needs +25KB.
64+
- Only viable with ternary quantization (4x more compact) which is a completely different codebase.
65+
- DECISION: Stay on int6/GPTQ path. Don't pursue 8192 vocab.
66+
67+
### Depth Recurrence (2026-03-28)
68+
- CONFIRMED DEAD END by 3 independent teams (PR #363, PR #499, ternary author)
69+
- Two structural taxes: quantization compounding + step time overhead = 0.025 BPB worse
70+
- Do not attempt.
71+
72+
### Novel Architectures (2026-03-28, PR #831)
73+
- 6 architectures tested, all failed at 16MB/600s constraint
74+
- SOTA stack is co-optimized: Parallel Muon + torch.compile + int6 + H100 tensor cores
75+
- Throughput tax: must improve BPB by 0.007 per ms/step overhead
76+
- SSM hybrid (Mamba) is 3.4x slower — completely unviable

0 commit comments

Comments
 (0)