You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Extra params: ~130K (3 LatentProjector modules + span embeddings)
210
+
-**STATUS: Code added to our_train_gpt.py, toggled via JEPA_ENABLED=1**
211
+
212
+
**2. Full Hessian GPTQ (replaces GPTQ-lite)**
213
+
- Collects H=X^TX via forward hooks during 128-batch calibration
214
+
- Per-column rounding error compensated using inverse Hessian
215
+
- Block size 128, percdamp 0.01
216
+
- Takes 13 seconds — fits in eval budget
217
+
-**STATUS: Not yet implemented in our code. Reference: PR #1006 train_gpt.py**
218
+
219
+
**3. AdamW TTT Pre-Quantization**
220
+
- Key finding: SGD-based TTT fails on CastedLinear architectures
221
+
- Fix: AdamW with cosine decay on EMA-averaged model BEFORE quantization
222
+
- GPTQ then quantizes the adapted weights
223
+
-**STATUS: Not yet implemented. Reference: PR #1006 train_gpt.py**
224
+
225
+
### Key Research Findings (PR #831)
226
+
227
+
**The throughput tax formula:** Any technique must improve BPB by 0.007 per millisecond of step time overhead. At 83ms/step baseline, each 1ms costs ~7 steps, each step ≈ 0.001 BPB.
**Takeaway:** The SOTA stack is co-optimized (Parallel Muon + torch.compile + int6 + tensor cores). Breaking any pillar cascades. JEPA is one of the few additions that doesn't break this co-optimization because it's only active during training (no eval cost).
240
+
241
+
## Codebase Layout
178
242
179
-
1.**Start from SOTA code** (rule 4) — don't rebuild
180
-
2.**Explore larger vocab** (rule 3) — 8192 BPE is the biggest untapped lever in the SOTA path
181
-
3.**Factored embeddings** to fit larger vocab in 16MB budget
-**RunPod credits:** Applied 2026-03-28, approval in 1-2 business days. Quick-start tier (~$25 / ~8 compute hours). Can also use Saiful's existing RunPod account.
284
+
-**MLX validation is slow:**~20min for full val on Mac. Training is fine for directional testing.
285
+
-**our_train_gpt.py requires CUDA + flash_attn_interface:** Cannot test locally. MLX script is separate.
- Expected: val_bpb ~1.1192 (same as Experiment 0)
25
+
- Result: pending
26
+
- Verdict: pending
27
+
28
+
## Experiment 1: JEPA auxiliary loss
29
+
- Date: pending
30
+
- Hypothesis: JEPA auxiliary loss improves BPB by acting as regularizer (PR #1006 claims it contributes to 1.1085). Extra ~130K params + forward compute for JEPA loss should add <5ms/step.
- Code status: NOT YET IMPLEMENTED. Reference implementation in PR #1006.
46
+
- Result: pending
47
+
48
+
## Experiment 3: AdamW TTT pre-quantization (planned, not yet coded)
49
+
- Date: pending
50
+
- Hypothesis: AdamW TTT on full-precision EMA weights before quantization gives better adaptation than SGD TTT on dequantized weights. PR #1006 found SGD fails on CastedLinear.
51
+
- Change: Replace SGD TTT with AdamW + cosine decay, run before GPTQ instead of after
52
+
- Hardware: 8xH100 SXM
53
+
- Success criteria: TTT BPB gain > current 0.0025 from SOTA's SGD TTT
54
+
- Code status: NOT YET IMPLEMENTED. Reference implementation in PR #1006.
55
+
- Result: pending
56
+
57
+
---
58
+
59
+
## Research Notes (no experiment needed)
60
+
61
+
### 8192 Vocab Analysis (2026-03-28)
62
+
- INVESTIGATED: Can we fit 8192 vocab in the SOTA's 16MB budget?
63
+
- ANSWER: No. SOTA artifact has only 9,994 bytes headroom. Even factored 8192×64 + 64×512 needs +25KB.
64
+
- Only viable with ternary quantization (4x more compact) which is a completely different codebase.
65
+
- DECISION: Stay on int6/GPTQ path. Don't pursue 8192 vocab.
66
+
67
+
### Depth Recurrence (2026-03-28)
68
+
- CONFIRMED DEAD END by 3 independent teams (PR #363, PR #499, ternary author)
69
+
- Two structural taxes: quantization compounding + step time overhead = 0.025 BPB worse
70
+
- Do not attempt.
71
+
72
+
### Novel Architectures (2026-03-28, PR #831)
73
+
- 6 architectures tested, all failed at 16MB/600s constraint
74
+
- SOTA stack is co-optimized: Parallel Muon + torch.compile + int6 + H100 tensor cores
75
+
- Throughput tax: must improve BPB by 0.007 per ms/step overhead
76
+
- SSM hybrid (Mamba) is 3.4x slower — completely unviable
0 commit comments