Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions records/track_non_record_16mb/gteifel_phase1_meta_ttt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Phase 1: Legal Score-First TTT + Meta-TTT (FOMAML)

**Author:** George Teifel (@george11642)
**Status:** Awaiting compute for validation
**Base:** PR #462 architecture (1.0672 BPB)
**Track:** non_record_16mb

## Approach

Building on PR #462's GEPA-discovered architecture (Star-ReLU + U-Net + XSA + AdamW TTT), this submission adds:

### Proven Techniques (Phase 1)
1. **XSA on all 11 layers** (PR #462 uses last 4 only). PR #478 shows XSA-all improves BPB.
2. **Cosine TTT scheduling with 30 epochs** (PR #462 uses 10). PR #481 shows cosine + more epochs helps.
3. **Per-layer TTT learning rate groups**: 3x LR for MLP output projections (highest quantization error), 0.5x for input projections, 1x for everything else. Based on PR #481's quantization damage analysis.
4. **GPTQ-lite optimal clip percentile search**: Per-row sweep of 6 percentile candidates to minimize reconstruction error before final quantization.
5. **Legal score-first TTT**: Evaluate tokens BEFORE training on them, complying strictly with the rule "you are only allowed to test-time train on validation set tokens you've already evaluated your model on."

### Novel Technique (Phase 2 - In Development)
6. **Meta-TTT (FOMAML)**: During training, periodically simulate the quantize-TTT-eval pipeline via first-order MAML. This teaches the model to produce weight configurations that are maximally adaptable during test-time training. No existing submission uses meta-learning for TTT optimization.

## Architecture

- 11 layers (5 encoder + 6 decoder, U-Net gated skips)
- dim=512, heads=8/8, MLP hidden=1792
- Star-ReLU activation with learned scale/bias
- BigramHash (8192 buckets, 128d), SmearGate
- Partial RoPE (16/64 dims), XSA on all layers
- Int6 QAT with GPTQ-lite clip search, zstd-22

## Expected Results

Awaiting 8xH100 compute to validate. Expected improvement over PR #462 baseline (1.0672):
- Phase 1 techniques: ~-0.012 BPB
- Meta-TTT (if successful): ~-0.015 BPB additional

## Confirmed Dead Ends (Not Attempted)
- Depth recurrence (PR #386: 1.4061 BPB)
- MoE (sparsity=0 optimal below 500M params)
- LoRA TTT (10x worse than full-param TTT)
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"name": "George Teifel",
"github_id": "george11642",
"val_bpb": null,
"track": "non_record_16mb",
"metadata": {
"description": "Phase 1: Legal score-first TTT + XSA-all + per-layer TTT LR + GPTQ-lite clip search. Building toward meta-TTT (MAML). Based on PR #462 architecture.",
"techniques": [
"Star-ReLU + U-Net + BigramHash + SmearGate (from PR #462)",
"XSA on all 11 layers (extended from XSA-4)",
"Cosine TTT 30 epochs with per-layer LR groups",
"GPTQ-lite optimal clip percentile search",
"Legal score-first TTT protocol",
"Meta-TTT (FOMAML) - in development"
],
"base": "PR #462 (1.0672 BPB)",
"status": "awaiting_compute_for_validation"
}
}
Loading