Skip to content

Commit 3e058d4

Browse files
george11642claude
andcommitted
feat: Phase 1 submission — Legal TTT + XSA-all + Meta-TTT (FOMAML)
Non-record submission building on PR openai#462's architecture with: - XSA on all 11 layers (was 4) - Cosine TTT 30 epochs with per-layer LR groups - GPTQ-lite optimal clip percentile search - Legal score-first TTT protocol - Meta-TTT (FOMAML) in development Awaiting compute for validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0f51451 commit 3e058d4

3 files changed

Lines changed: 1653 additions & 0 deletions

File tree

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Phase 1: Legal Score-First TTT + Meta-TTT (FOMAML)
2+
3+
**Author:** George Teifel (@george11642)
4+
**Status:** Awaiting compute for validation
5+
**Base:** PR #462 architecture (1.0672 BPB)
6+
**Track:** non_record_16mb
7+
8+
## Approach
9+
10+
Building on PR #462's GEPA-discovered architecture (Star-ReLU + U-Net + XSA + AdamW TTT), this submission adds:
11+
12+
### Proven Techniques (Phase 1)
13+
1. **XSA on all 11 layers** (PR #462 uses last 4 only). PR #478 shows XSA-all improves BPB.
14+
2. **Cosine TTT scheduling with 30 epochs** (PR #462 uses 10). PR #481 shows cosine + more epochs helps.
15+
3. **Per-layer TTT learning rate groups**: 3x LR for MLP output projections (highest quantization error), 0.5x for input projections, 1x for everything else. Based on PR #481's quantization damage analysis.
16+
4. **GPTQ-lite optimal clip percentile search**: Per-row sweep of 6 percentile candidates to minimize reconstruction error before final quantization.
17+
5. **Legal score-first TTT**: Evaluate tokens BEFORE training on them, complying strictly with the rule "you are only allowed to test-time train on validation set tokens you've already evaluated your model on."
18+
19+
### Novel Technique (Phase 2 - In Development)
20+
6. **Meta-TTT (FOMAML)**: During training, periodically simulate the quantize-TTT-eval pipeline via first-order MAML. This teaches the model to produce weight configurations that are maximally adaptable during test-time training. No existing submission uses meta-learning for TTT optimization.
21+
22+
## Architecture
23+
24+
- 11 layers (5 encoder + 6 decoder, U-Net gated skips)
25+
- dim=512, heads=8/8, MLP hidden=1792
26+
- Star-ReLU activation with learned scale/bias
27+
- BigramHash (8192 buckets, 128d), SmearGate
28+
- Partial RoPE (16/64 dims), XSA on all layers
29+
- Int6 QAT with GPTQ-lite clip search, zstd-22
30+
31+
## Expected Results
32+
33+
Awaiting 8xH100 compute to validate. Expected improvement over PR #462 baseline (1.0672):
34+
- Phase 1 techniques: ~-0.012 BPB
35+
- Meta-TTT (if successful): ~-0.015 BPB additional
36+
37+
## Confirmed Dead Ends (Not Attempted)
38+
- Depth recurrence (PR #386: 1.4061 BPB)
39+
- MoE (sparsity=0 optimal below 500M params)
40+
- LoRA TTT (10x worse than full-param TTT)
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
{
2+
"name": "George Teifel",
3+
"github_id": "george11642",
4+
"val_bpb": null,
5+
"track": "non_record_16mb",
6+
"metadata": {
7+
"description": "Phase 1: Legal score-first TTT + XSA-all + per-layer TTT LR + GPTQ-lite clip search. Building toward meta-TTT (MAML). Based on PR #462 architecture.",
8+
"techniques": [
9+
"Star-ReLU + U-Net + BigramHash + SmearGate (from PR #462)",
10+
"XSA on all 11 layers (extended from XSA-4)",
11+
"Cosine TTT 30 epochs with per-layer LR groups",
12+
"GPTQ-lite optimal clip percentile search",
13+
"Legal score-first TTT protocol",
14+
"Meta-TTT (FOMAML) - in development"
15+
],
16+
"base": "PR #462 (1.0672 BPB)",
17+
"status": "awaiting_compute_for_validation"
18+
}
19+
}

0 commit comments

Comments
 (0)